How do I coerce a tibble dataframe column from double to time? - r

I'm in the tidyverse.
I read in several CSV files using read_csv (all have the same columns)
df <- read_csv("data.csv")
to obtain the a series of dataframes. After a bunch of data cleaning and calculations, I want to merge all the dataframes.
There are a dozen dataframes of several hundred rows, and a few dozen columns. A minimal example is
DF1
ID name costcentre start stop date
<chr> <chr> <chr> <time> <tim> <chr>
1 R_3PMr4GblKPV~ Geo Prizm 01:00 03:00 25/12/2019
2 R_s6IDep6ZLpY~ Chevy Malibu NA NA NA
3 R_238DgbfO0hI~ Toyota Corolla 08:00 11:00 25/12/2019
DF2
ID name costcentre start stop date
<chr> <chr> <chr> <lgl> <time> <chr>
1 R_3PMr4GblKPV1OYd Geo Prizm NA NA NA
2 R_s6IDep6ZLpYvUeR Chevy Malibu NA 03:00 12/12/2019
3 R_238DgbfO0hItPxZ Toyota Corolla NA NA NA
Based on my cleaning requirements (is start == NA & stop != NA), some of the NAs in start must be 00:00. I can enter a zero in that cell:
df <- within(df, start[is.na(df$start) & !is.na(df$stop)] <- 0)
This results in
DF1
ID name costcentre start stop date
<chr> <chr> <chr> <time> <tim> <chr>
1 R_3PMr4GblKPV~ Geo Prizm 01:00 03:00 25/12/2019
2 R_s6IDep6ZLpY~ Chevy Malibu NA NA NA
3 R_238DgbfO0hI~ Toyota Corolla 08:00 11:00 25/12/2019
DF2
ID name costcentre start stop date
<chr> <chr> <chr> <dbl> <time> <chr>
1 R_3PMr4GblKPV1OYd Geo Prizm NA NA NA
2 R_s6IDep6ZLpYvUeR Chevy Malibu 0 03:00 12/12/2019
3 R_238DgbfO0hItPxZ Toyota Corolla NA NA NA
I run into issues on merging, as sometimes start is a double (as I've done some replacements), is logical (as there were all NAs with no replacements), or is time (if there were some times in the original data reading)
merged_df <- bind_rows(DF1, DF2,...)
gives me the error Error: Columnstartcan't be converted from hms, difftime to numeric
How do I coerce the start column to be of the type time so that I may merge my data?

I think the important point is that the columns start and stop, which appear to be of type time, are based on the hms package. I wondered why/when is displayed, becauses I had not heared about this class before.
As I see it, these columns are actually of class hms and difftime. Such objects are actually stored not in minutes (as the printed tibble suggests) but in seconds. We see this if we look at the data via View(df). Interestingly, if we print the data, the variable type is displayed as time.
To solve your problem, you have to convert all your start and stop columns consistently into hms difftime columns as in the example below.
Minimal reproducible example:
library(dplyr)
library(hms)
df1 <- tibble(id = 1:3,
start = as_hms(as.difftime(c(1*60,NA,8*60), units = "mins")),
stop = as_hms(as.difftime(c(3*60,NA,11*60), units = "mins")))
df2 <- tibble(id = 4:6,
start = c(NA,NA,NA),
stop = as_hms(as.difftime(c(NA,3*60,NA), units = "mins")))
Or even easier (but with slightly different printing than in the question):
df1 <- tibble(id = 1:3,
start = as_hms(c(1*60,NA,8*60)),
stop = as_hms(c(3*60,NA,11*60)))
df2 <- tibble(id = 4:6,
start = c(NA,NA,NA),
stop = as_hms(c(NA,3*60,NA)))
Solving the problem:
class(df1$start) # In df1 start has class hms and difftime
class(df2$start) # In df2 start has class logical
# We set start=0 if stop is not missing and turn the whole column into an hms object
df2 <- df2 %>% mutate(start = new_hms(ifelse(!is.na(stop), 0, NA)))
# Now that column types are consistent across tibbles we can easily bind them together
df <- bind_rows(df1, df2)
df

Related

Aggregate week and date in R by some specific rules

I'm not used to using R. I already asked a question on stack overflow and got a great answer.
I'm sorry to post a similar question, but I tried many times and got the output that I didn't expect.
This time, I want to do slightly different from my previous question.
Merge two data with respect to date and week using R
I have two data. One has a year_month_week column and the other has a date column.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022. What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022.If more than one date are in year_month_week, I will just include the first of them. Now, here's the different part. Even if there is no date inside year_month_week,just leave it NA. So my expected output has a same number of rows as df1 which includes the column year_month_week.So my expected output is as follows:
df<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43),
temperature=c(36.1,36.6,NA,34.3,34.9,NA,NA))
First we can convert the dates in df2 into year-month-date format, then join the two tables:
library(dplyr);library(lubridate)
df2$dt = ymd(df2$date)
df2$wk = day(df2$dt) %/% 7 + 1
df2$year_month_week = as.numeric(paste0(format(df2$dt, "%Y%m"), df2$wk))
df1 %>%
left_join(df2 %>% group_by(year_month_week) %>% slice(1) %>%
select(year_month_week, temperature))
Result
Joining, by = "year_month_week"
id year_month_week points temperature
1 1 2022051 65 36.1
2 1 2022052 58 36.6
3 1 2022053 47 NA
4 2 2022041 21 34.3
5 2 2022042 25 34.9
6 2 2022043 27 NA
7 2 2022044 43 NA
You can build off of a previous answer here by taking the function to count the week of the month, then generate a join key in df2. See here
df1 <- data.frame(
id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2 <- data.frame(
id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
# Take the function from the previous StackOverflow question
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
# Create a year_month_week variable to join on
df2 <-
df2 %>%
mutate(
date = lubridate::parse_date_time(
x = date,
orders = "%Y%m%d"),
year_month_week = paste0(
lubridate::year(date),
0,
lubridate::month(date),
monthweeks.Date(date)),
year_month_week = as.double(year_month_week))
# Remove duplicate year_month_weeks
df2 <-
df2 %>%
arrange(year_month_week) %>%
distinct(year_month_week, .keep_all = T)
# Join dataframes
df1 <-
left_join(
df1,
df2,
by = "year_month_week")
Produces this result
id.x year_month_week points id.y date temperature
1 1 2022051 65 1 2022-05-03 36.1
2 1 2022052 58 1 2022-05-12 36.6
3 1 2022053 47 NA <NA> NA
4 2 2022041 21 2 2022-04-01 34.3
5 2 2022042 25 2 2022-04-08 34.9
6 2 2022043 27 NA <NA> NA
7 2 2022044 43 NA <NA> NA
>
Edit: forgot to mention that you need tidyverse loaded
library(tidyverse)

If event 1 occurs, how often does event 2 AND event 3 co-occur?

I have a heating system with three thermostats. If my first thermostat gets too hot it may cause my second to get too hot (although therm 2 could get too hot from other sources), if my second gets too hot it may cause my third to get too hot. What I would like to know is...if my first thermostat gets too hot (recorded as an Event with a date_start and date_end) how often do events in my second AND third thermostat co-occur (what I'm calling a triple whammy event)?
I would define a triple whammy event as such...The date_start of Temp2 AND Temp3 would have to occur between the date_start and date_end of Temp1.
> df1$Therm1
date_start date_end Event Site
1 2002-04-12 2002-04-21 1 Therm1
2 2002-06-26 2002-07-05 2 Therm1
3 2002-08-15 2002-08-20 3 Therm1
4 2005-08-08 2005-08-19 4 Therm1
> df2$Therm2
date_start date_end Event Site
1 2002-04-13 2002-04-19 1 Therm2
2 2002-08-11 2002-08-19 2 Therm2
3 2005-06-09 2005-06-14 3 Therm2
4 2005-08-10 2005-08-14 4 Therm2
> df3$Therm3
date_start date_end Event Site
1 2002-04-14 2002-04-19 1 Therm3
2 2002-08-11 2002-08-19 2 Therm3
3 2005-06-09 2005-06-14 3 Therm3
4 2005-08-10 2005-08-14 4 Therm3
In this example a triple whammy event occurs during the 1 and 4 Event of df1$Therm1 because the date_start in df2$Therm2 AND df3$Therm3 occur between the date_start and date_end of Events in df1$Therm1.
One way to do this is using lubridate functions interval and %within%. They're pretty clearly named; interval creates a time period and %within% checks whether a supplied time point is within that interval.
Assuming that df1...df3 are actual data frames and not lists of dataframes as they appear to be in the question, we firstly add an interval variable to df1, which is our reference interval. We also need to transform the start dates of df2 and df3 into date objects with ymd:
library(lubridate)
library(dplyr)
df1 <- df1 %>%
mutate(interval = interval(
start = start, end = end))
df2 <- df2 %>%
mutate(start = ymd(start))
df3 <- df3 %>%
mutate(start = ymd(start))
Then it could be as simple as looking for start times from df2 and df3 that are within df1$interval:
df1$event[which(df2$start %within% df1$interval & df3$start %within% df1$interval)]
# [1] 1 4
This assumes that there is a constant number of events across each thermostat (i.e., consistent with your example data), but I don't think that's what you really want. I think a more robust approach would be to check whether a particular interval has start dates within it from both df2 and df3, e.g.,
df1 %>%
rowwise() %>%
mutate(tripleWhammy =
any(df2$start %within% interval) &
any(df3$start %within% interval))
## A tibble: 4 x 6
## Rowwise:
# start end event site interval #tripleWhammy
# <chr> <chr> <dbl> <chr> <Interval> <lgl>
#1 2002-04-… 2002-04-… 1 Ther… 2002-04-12 UTC--2002-04-21 UTC TRUE
#2 2002-06-… 2002-07-… 2 Ther… 2002-06-26 UTC--2002-07-05 UTC FALSE
#3 2002-08-… 2002-08-… 3 Ther… 2002-08-15 UTC--2002-08-20 UTC FALSE
#4 2005-08-… 2005-08-… 4 Ther… 2005-08-08 UTC--2005-08-19 UTC TRUE
Data:
df1 <- data.frame(
start = c('2002-04-12', '2002-06-26', '2002-08-15', '2005-08-08'),
end = c('2002-04-21', '2002-07-05', '2002-08-20', '2005-08-19'),
event = c(1,2,3,4),
site = 'Therm1')
df2 <- data.frame(
start = c('2002-04-13', '2002-08-11', '2005-06-09', '2005-08-10'),
end = c('2002-04-19', '2002-08-19', '2005-06-14', '2005-08-14'),
event = c(1,2,3,4),
site = 'Therm2')
df3 <- data.frame(
start = c('2002-04-14', '2002-08-11', '2005-06-09', '2005-08-10'),
end = c('2002-04-19', '2002-08-19', '2005-06-14', '2005-08-14'),
event = c(1,2,3,4),
site = 'Therm3')

parse dates from multiple columns with NAs and dates hidden in text

I have a data.frame with dates distributed across columns and in a messy format: the year column contains years and NAs, the column date_old contains the format Month DD or DD (or a date duration) or NAs, and the column hidden_date contains text and dates either in thee format .... YYYY .... or in the format .... DD Month YYYY .... (with .... representing general text of variable length).
An example data.frame looks like this:
df <- data.frame(year = c("1992", "1993", "1995", NA),
date_old = c("February 15", "October 02-24", "15", NA),
hidden_date = c(NA, NA, "The hidden date is 15 July 1995", "The hidden date is 2005"))
I want to get the dates in the format YYYY-MM-DD (take the first day of date durations) and fill unknown values with zeroes.
Using parse_date_time didn't help me so far, and the expected output would be:
year date_old hidden_date date
1 1992 February 15 <NA> 1992-02-15
2 1993 October 02-24 <NA> 1993-10-02
3 1995 15 The hidden date is 15 July 1995 1995-07-15
4 <NA> <NA> The hidden date is 2005 2005-00-00
How do I best go about this?
It's a little complicated because you have a jumble of date information in different columns which you need to extract and combine. I don't quite understand if you only have three columns, or if there could be more, so I've tried to solve the general case of an arbitray number of columns. If you only have three columns, each of which always have the same format, then things could be a little simpler, but not much.
I would start by creating a regex pattern for month names:
# We'll use dplyr, stringr, tidyr, readr, and purrr
library(tidyverse)
# We'll use month names and abbreviations just in case.
ms <- paste(c(month.name, month.abb), collapse = "|")
# [1] "January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec"
We can then iterate over each column, extracting the year, month, and day from each row as a data frame, which we then combine into a single data frame. The digit suffixes correspond to the original columns:
df_split_ymd <- map_dfc(df,
~ map_dfr(
.,
~ tibble(
year = str_extract(., "\\b\\d{4}\\b"),
month = str_extract(., str_glue("\\b({ms})\\b")),
day = str_extract(., "\\b\\d{2}\\b")
)
)
)
#### OUTPUT ####
# A tibble: 4 x 9
year month day year1 month1 day1 year2 month2 day2
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1992 NA NA NA February 15 NA NA NA
2 1993 NA NA NA October 02 NA NA NA
3 1995 NA NA NA NA 15 1995 July 15
4 NA NA NA NA NA NA 2005 NA NA
Finally, the year*, month*, and day* columns should be coalesced and then united to make parsing easier. Note that I've replaced NA values in day
with "01" and those in month with "January" because dates can't contain "00":
df_ymd <- df_split_ymd %>%
mutate(year = coalesce(!!!as.list(select(., starts_with("year")))),
month = coalesce(!!!as.list(select(., starts_with("month")))) %>%
replace_na("January"),
day = coalesce(!!!as.list(select(., starts_with("day")))) %>%
replace_na("01")
) %>%
unite(ymd, year, month, day, sep = " ") %>%
select(ymd) %>%
mutate(ymd = parse_date(ymd, "%Y %B %d"))
#### OUTPUT ####
# A tibble: 4 x 1
ymd
<date>
1 1992-02-15
2 1993-10-02
3 1995-07-15
4 2005-01-01

Rolling 7 Day Sum grouped by date and unique ID

I am using workload data to compute 3 metrics - Daily, 7-Day rolling (sum of last 7 days) 28-Day Rolling Average(sum of last 28 days/4).
I have been able to compute by Daily but I need my 7-Day rolling and 28-Day Rolling Average and am having some trouble. I have 17 unique ID's for each date (dates range from 2018-08-09 to 2018-12-15).
library(dplyr)
library(tidyr)
library(tidyverse)
library(zoo)
Post_Practice <- read.csv("post.csv", stringsAsFactors = FALSE)
Post_Data <- Post_Practice[, 1:3]
DailyLoad <- Post_Data %>%
group_by(Date, Name) %>%
transmute(Daily = sum(DayLoad)) %>%
distinct(Date, Name, .keep_all = TRUE) %>%
mutate('7-day' = rollapply(Daily, 7, sum, na.rm = TRUE, partial = TRUE))
Input:
Date Name DayLoad
2018-08-09 Athlete 1 273.92000
2018-08-09 Athlete 2 351.16000
2018-08-09 Athlete 3 307.97000
2018-08-09 Athlete 1 434.20000
2018-08-09 Athlete 2 605.92000
2018-08-09 Athlete 3 432.87000
Input looks like this all the way to 2018-12-15. Some dates have multiples of data (like above) and some only have one entry.
This code produces the 7-day column but it shows the same number as the Daily ie:
Date Name Daily 7-day
<chr> <chr> <dbl> <dbl>
1 2018-08-09 Athlete 1 708. 708.
2 2018-08-09 Athlete 2 957. 957.
3 2018-08-09 Athlete 3 741. 741.
The goal is to have final table (ie 7 days later) look like this:
Date Name Daily 7-day
<chr> <chr> <dbl> <dbl>
1 2018-08-15 Athlete 1 413. 3693.
2 2018-08-15 Athlete 2 502. 4348.
3 2018-08-15 Athlete 3 490. 4007.
Where the Daily is the sum of that specific date and the 7-Dayis the sum of the last 7 dates for that specific unique ID.
The help file for rollsum says:
The default methods of rollmean and rollsum do not handle inputs that
contain NAs.
Use rollapplyr(x, width, sum, na.rm = TRUE) to exclude NAs in the input from the sum. Note the r at the end of rollapplyr to specify right alignment.
Also note the partial=TRUE argument can be used if you want partial sums at the beginning rather than NAs.

Trying to cast rows to columns in r

I am trying to something very simple in R, transpose a data set so I can create a primary key for joining with other tables that have many values.
I've tried dcast and aggregate, and haven't gotten them to work.
Here's what my dataframe currently looks like
Current R dataframe
Here's what I would like it to look like:
New R dataframe
You can insere code in your post, so paste the code that create your data.frame, like this:
df <- data.frame(
Make = c('Ford', 'Ford', 'Ford', 'Chevy', 'Chrysler', 'Chrysler'),
DateSold = c('2017-07-01', '2017-08-01', '2017-10-01', '2017-01-01', '2017-03-01', '2017-04-01'),
Amount = c(30, 15, 25, 23, 22, 21) * 1e3
)
Now for your question, you can use the library tidyverse which have a lot of useful functions to manipulate data. You can execute the following code line by line in order to understand the different steps to arrive to the solution.
library(tidyverse)
df %>%
gather(-Make, key = Column, value = Value) %>%
group_by(Make, Column) %>%
mutate(Count = 1:n()) %>%
unite(Column_count, Column, Count) %>%
spread(Column_count, Value)
# Make Amount_1 Amount_2 Amount_3 DateSold_1 DateSold_2 DateSold_3
# <fct> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Chevy 23000 NA NA 2017-01-01 NA NA
# 2 Chrysler 22000 21000 NA 2017-03-01 2017-04-01 NA
# 3 Ford 30000 15000 25000 2017-07-01 2017-08-01 2017-10-01
Using reshape, you can do somithing like:
reshape(transform(df,time=ave(Amount,Make,FUN=seq_along)),dir = 'wide',idvar='Make')
Make DateSold.1 Amount.1 DateSold.2 Amount.2 DateSold.3 Amount.3
1 Ford 2017-07-01 30000 2017-08-01 15000 2017-10-01 25000
4 Chevy 2017-01-01 23000 <NA> NA <NA> NA
5 Chrysler 2017-03-01 22000 2017-04-01 21000 <NA> NA

Resources