Duplicate rows in R based on content of columns - r

I'm working with a school schedule dataset for a visualization project and had days of classes originally in the form "MW" or "TTh" etc - they are now in the format below:
name start end first second
finance 9:00 10:00 M W
stats 10:30 11:30 T Th
econ 16:30 19:00 T NA
I'm looking to duplicate the first three columns to get a dataframe that looks like:
day name start end
M finance 9:00 10:00
W finance 9:00 10:00
T stats 10:30 11:30
Th stats 10:30 11:30
W econ 10:30 11:30
Any ideas?

We can use pivot_longer
library(dplyr)
library(tidyr)
pivot_longer(df1, cols = c(first, second), values_to = 'day',
names_to = 'name1') %>%
select(day, name, start, end) %>%
filter(complete.cases(day))
-output
# A tibble: 5 x 4
# day name start end
# <chr> <chr> <chr> <chr>
#1 M finance 9:00 10:00
#2 W finance 9:00 10:00
#3 T stats 10:30 11:30
#4 Th stats 10:30 11:30
#5 T econ 16:30 19:00
data
df1 <- structure(list(name = c("finance", "stats", "econ"), start = c("9:00",
"10:30", "16:30"), end = c("10:00", "11:30", "19:00"), first = c("M",
"T", "T"), second = c("W", "Th", NA)), class = "data.frame", row.names = c(NA,
-3L))

Related

in R Replace values with following value

I have a data set that I'm cleaning. The 2nd column starts with a - while the value below it is the one I need. How do I replace the - with the value under it.
I have thousands of rows like this, with the value below it being different names so I cant just do
df$agent[df$agen == "-"] <- "john"
It would have to be done over 1,000 times. I'm looking for a way to do this much more efficiently.
1 Field Support - 6:00 AM - 6:59 AM 1/1/2020 9
3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
You can use case_when and lead from the dplyr package:
library(dplyr)
data %>%
mutate(Name = case_when(Name == "-" ~ lead(Name),
TRUE ~ Name))
Role Name Time Date Value
1 Field Support John 6:00 AM - 6:59 AM 1/1/2020 9
2 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
3 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
Data
data <- structure(list(Role = structure(c(1L, 1L, 1L), .Label = "Field Support", class = "factor"),
Name = structure(c(1L, 2L, 2L), .Label = c("-", "John"), class = "factor"),
Time = structure(1:3, .Label = c("6:00 AM - 6:59 AM", "7:00 AM - 7:59 AM",
"8:00 AM - 8:59 AM"), class = "factor"), Date = structure(c(1L,
1L, 1L), .Label = "1/1/2020", class = "factor"), Value = c(9L,
4L, 4L)), class = "data.frame", row.names = c(NA, -3L))
Here is a solution with base R and package tidyr.
library(tidyr)
col_num <- 3
is.na(df[[col_num]]) <- df[[col_num]] == '-'
fill(df, all_of(col_num), .direction = "up")
# V1 V2 V3 V4 V5 V6
#1 1 Field Support John 6:00 AM - 6:59 AM 1/1/2020 9
#2 3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
#3 4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
Data
df <- read.table(text = "
1 'Field Support' - '6:00 AM - 6:59 AM' 1/1/2020 9
3 'Field Support' John '7:00 AM - 7:59 AM' 1/1/2020 4
4 'Field Support' John '8:00 AM - 8:59 AM' 1/1/2020 4
")
Here's a solution without using any packages:
> df <- data.frame("ID" = c(1, 3, 4, 5, 8),
"Job" = rep("Field Support", 5),
"Agent" = c("-", rep("John", 2), "-", "Mary"),
"Hours" = c("6:00 AM - 6:59 AM",
"7:00 AM - 7:59 AM",
"8:00 AM - 8:59 AM",
"9:00 AM - 9:59 AM",
"10:00 AM - 10:59 AM"),
"Date" = rep("1/1/2020", 5),
"Metric" = c(9, 4, 4, 6, 2))
> print(df)
ID Job Agent Hours Date Metric
1 1 Field Support - 6:00 AM - 6:59 AM 1/1/2020 9
2 3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
3 4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
4 5 Field Support - 9:00 AM - 9:59 AM 1/1/2020 6
5 8 Field Support Mary 10:00 AM - 10:59 AM 1/1/2020 2
> df$Agent[which(df$Agent == "-")] <- df$Agent[which(df$Agent == "-") + 1]
> print(df)
ID Job Agent Hours Date Metric
1 1 Field Support John 6:00 AM - 6:59 AM 1/1/2020 9
2 3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
3 4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
4 5 Field Support Mary 9:00 AM - 9:59 AM 1/1/2020 6
5 8 Field Support Mary 10:00 AM - 10:59 AM 1/1/2020 2

Groupby a column and find its sum and count

Background:
I have a dataset, df,
Date Duration
1/2/2020 5:00:00 PM 20
1/2/2020 5:30:01 PM 30
1/2/2020 6:00:00 PM 10
1/5/2020 7:00:01 AM 5
1/6/2020 8:00:00 AM 2
1/6/2020 9:00:00 AM 8
Desired Output:
Date Total_Duration Count
1/2/2020 60 3
1/5/2020 5 1
1/6/2020 10 2
Dput:
structure(list(Date = structure(1:6, .Label = c("1/2/2020 5:00:00 PM",
"1/2/2020 5:30:01 PM", "1/2/2020 6:00:00 PM", "1/5/2020 7:00:01 AM",
"1/6/2020 8:00:00 AM", "1/6/2020 9:00:00 AM"), class = "factor"),
Duration = c(20L, 30L, 10L, 5L, 2L, 8L)), class = "data.frame", row.names = c(NA,
-6L))
What I have tried:
library(dplyr)
df %>% group_by(Date) %>% add_tally() %>%
summarize(Duration)
Any guidance will be helpful.
We can get the Date only part from the 'Date' after converting to 'DateTime' with dmy_hms (assuming the format is DD/MM/YYYYY HH::MM:SS), use that as grouping variable and get the sum of 'Duration' and 'Count' as the n()
library(dplyr)
library(lubridate)
df %>%
group_by(Date = as.Date(dmy_hms(Date))) %>%
summarise(Total_Duration = sum(Duration), Count = n())
# A tibble: 3 x 3
# Date Total_Duration Count
# <date> <int> <int>
#1 2020-02-01 60 3
#2 2020-05-01 5 1
#3 2020-06-01 10 2

Convert from military time to UTC in R

I have a dataset, df1, I would like to convert all the values from the 24 hour clock to UTC.
Date Name
1/2/2020 16:46 A
1/2/2020 16:51 B
I Would like
Date Name
1/2/2020 4:46:47 PM A
1/2/2020 4:51:44 PM B
I have tried:
df$Date<- format(df$Date, "%m/%d/%Y %I:%M:%S %p")
dput:
structure(list(Date = structure(1:2, .Label = c("1/2/2020 16:46",
"1/2/2020 16:51"), class = "factor"), Name = structure(1:2, .Label = c("A",
"B"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
You can first convert the data to POSIXct format and then use format to get data in the required format.
df$Date <- format(as.POSIXct(df$Date, format = "%m/%d/%Y %H:%M"),
"%m/%d/%Y %I:%M:%S %p")
#Can also use mdy_hm from lubridate
#df$Date <- format(lubridate::mdy_hm(df$Date), "%m/%d/%Y %I:%M:%S %p")
df
# Date Name
#1 01/02/2020 04:46:00 PM A
#2 01/02/2020 04:51:00 PM B
Assuming you want to actually convert a string in one format to a string in another format rather than having it as a (more useful) actual date/time, you can use a little arithmetic and string chopping along with mapply:
splits <- strsplit(as.character(df$Date), " |:")
Hours <- as.numeric(sapply(splits, `[`, 2))
AMPM <- c(" AM", " PM")[Hours %/% 12 + 1]
Hours <- Hours %% 13 + Hours %/% 13
df$Date <- mapply(function(x, y, z) paste0(x[1], " ", y, ":", x[3], z), splits, Hours, AMPM)
df
#> Date Name
#> 1 1/2/2020 4:46 PM A
#> 2 1/2/2020 4:51 PM B
Created on 2020-02-26 by the reprex package (v0.3.0)
Assuming the same assumptions as the previous answer by Allan, here is another way of converting from 24 hour to 12 hour.
library(tidyverse)
library(lubridate)
df <- tibble(
date = c(ymd_hms("2020/01/02 16:46:00", "2020/01/02 16:51:00", tz = "UTC")),
name = c("A", "B")
)
df %>%
mutate(date_hour = hour(date),
am_pm = if_else(date_hour > 12, "PM", "AM"),
date_hour = if_else(date_hour > 12, date_hour - 12, date_hour - 0),
newdatetime = paste0(date(date), " ", date_hour , ":", minute(date), " ", am_pm)) %>%
select(-c(date_hour, am_pm))
df
# A tibble: 2 x 3
date name newdatetime
<dttm> <chr> <chr>
1 2020-01-02 16:46:00 A 2020-01-02 4:46 PM
2 2020-01-02 16:51:00 B 2020-01-02 4:51 PM
Hope this helps!

segregation datetime on hourly basis in R

I'm using below-mentioned dataframe in R:
ID Datetime Value
T-1 2020-01-01 15:12:14 10
T-2 2020-01-01 00:12:10 20
T-3 2020-01-01 03:11:11 25
T-4 2020-01-01 14:01:01 20
T-5 2020-01-01 18:07:11 10
T-6 2020-01-01 20:10:09 15
T-7 2020-01-01 15:45:23 15
Using the above mentioned dataframe, I want to segregate datetime on hourly basis. For which, I'm using following code.
library(tidyverse)
DF$bins <- cut(lubridate::hour(DF$Datetime), c(-1, 0:24 - 0.01))
levels(DF$bins) <- c("00:00 to 00:59", "00:01 to 01:59", "00:02 to 02:59", "00:03 to 03:59", "00:04 to 04:59", "00:05 to 05:59",
"00:06 to 06:59", "00:07 to 07:59", "00:08 to 08:59", "00:09 to 09:59", "00:10 to 10:59", "00:11 to 11:59",
"00:12 to 12:59", "00:13 to 13:59", "00:14 to 14:59", "00:15 to 15:59", "00:16 to 16:59", "00:17 to 17:59",
"00:18 to 18:59", "00:19 to 19:59", "00:20 to 20:59", "00:21 to 21:59", "00:22 to 22:59", "00:23 to 23:59")
newDF <- DF %>%
dplyr::group_by(bins, .drop = FALSE) %>%
dplyr::summarise(Count = length(Value), Total = sum(Value))
Final<-newDF %>%
dplyr::summarise(bins = "January", Count = sum(Count), Total = sum(Total)) %>% bind_rows(newDF)
Final[,c(2,3)]<-sapply(Final[,c(2,3)], function(x) scales::comma(x))
at levels(DF$bins)<- I'm getting error Error inlevels<-.factor(tmp, value = c("00:00 to 00:59", "00:01 to 01:59", :
number of levels differs
How to keep below mentioned segregation static and aggregate the numbers accordingly.
"00:00 to 00:59", "00:01 to 01:59", "00:02 to 02:59", "00:03 to 03:59", "00:04 to 04:59", "00:05 to 05:59", "00:06 to 06:59", "00:07 to 07:59", "00:08 to 08:59", "00:09 to 09:59", "00:10 to 10:59", "00:11 to 11:59","00:12 to 12:59", "00:13 to 13:59", "00:14 to 14:59", "00:15 to 15:59", "00:16 to 16:59", "00:17 to 17:59","00:18 to 18:59", "00:19 to 19:59", "00:20 to 20:59", "00:21 to 21:59", "00:22 to 22:59", "00:23 to 23:59"
Expected Output:
Month Count Sum
Jan-20 7 115
12:00 AM to 05:00 AM 2 45
06:00 AM to 12:00 PM 0 0
12:00 PM to 03:00 PM 1 20
03:00 PM to 08:00 PM 3 35
08:00 PM to 12:00 AM 1 15
We can use floor_date/ceiling_date from lubridate to create hourly breaks, create a grouping column (bins) based on our requirement using sprintf and then use this column to calculate whatever we want for each group.
library(dplyr)
library(lubridate)
df %>%
mutate(bins = floor_date(Datetime, "hour"),
hour = hour(bins),
bins = paste0(sprintf("%02d:00 :", hour), sprintf(" %02d:59", hour))) %>%
group_by(bins) %>%
summarise(sum = sum(Value))
# A tibble: 6 x 2
# bins sum
# <chr> <int>
#1 00:00 : 00:59 20
#2 03:00 : 03:59 25
#3 14:00 : 14:59 20
#4 15:00 : 15:59 25
#5 18:00 : 18:59 10
#6 20:00 : 20:59 15
For the updated condition, we can do
df %>%
mutate(hour = hour(Datetime),
gr = case_when(hour >= 0 & hour < 6 ~ "12:00 AM to 06:00 AM",
hour >= 6 & hour < 12 ~ "06:00 AM to 12:00 PM",
hour >= 12 & hour < 15 ~ "12:00 PM to 03:00 PM",
hour >= 15 & hour < 20 ~ "03:00 PM to 08:00 PM",
TRUE ~ "08:00 PM to 12:00 AM"),
month_year = format(Datetime, "%Y-%m"),
bins = factor(gr, levels = c("12:00 AM to 06:00 AM", "06:00 AM to 12:00 PM",
"12:00 PM to 03:00 PM", "03:00 PM to 08:00 PM",
"08:00 PM to 12:00 AM"))) %>%
group_by(month_year, bins, .drop = FALSE) %>%
summarise(sum = n())
# month_year bins sum
# <chr> <fct> <int>
#1 2020-01 12:00 AM to 06:00 AM 2
#2 2020-01 06:00 AM to 12:00 PM 0
#3 2020-01 12:00 PM to 03:00 PM 1
#4 2020-01 03:00 PM to 08:00 PM 3
#5 2020-01 08:00 PM to 12:00 AM 1
data
df <- structure(list(ID = structure(1:7, .Label = c("T-1", "T-2", "T-3",
"T-4", "T-5", "T-6", "T-7"), class = "factor"), Datetime = structure(c(1577891534,
1577837530, 1577848271, 1577887261, 1577902031, 1577909409, 1577893523
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Value = c(10L,
20L, 25L, 20L, 10L, 15L, 15L)), row.names = c(NA, -7L), class = "data.frame")

Converting Date/Time in R

I am struggling hard with date time formatting in R. I am sure this is an easy fix... can someone write me a line of code that will convert all values from Year, M, D, Time into a new column "datetime"?
What data looks like:
x year m d time
A 2019 2 23 11:12 PM
B 2019 1 31 2:04 PM
C 2018 12 31 12:01 AM
D 2017 2 1 10:14 AM
What I want:
x datetime
A 2/23/19 11:12 PM
B 1/31/19 11:12 PM
C 12/31/18 12:01 AM
D 2/23/17 10:14 PM
Since it's a datetime value we can convert it into a standard format by pasting the values together.
df$datetime <- with(df, as.POSIXct(paste(year, m, d, time),
format = "%Y %m %d %I:%M %p", tz = "UTC"))
df
# x year m d time datetime
#1 A 2019 2 23 11:12PM 2019-02-23 23:12:00
#2 B 2019 1 31 2:04PM 2019-01-31 14:04:00
#3 C 2018 12 31 12:01AM 2018-12-31 00:01:00
#4 D 2017 2 1 10:14AM 2017-02-01 10:14:00
Or using lubridate
library(dplyr)
library(lubridate)
df %>% mutate(datetime = ymd_hm(paste(year, m, d, time)))
data
df <- structure(list(x = structure(1:4, .Label = c("A", "B", "C", "D"
), class = "factor"), year = c(2019L, 2019L, 2018L, 2017L), m = c(2L,
1L, 12L, 2L), d = c(23L, 31L, 31L, 1L), time = c("11:12 PM",
"2:04 PM", "12:01 AM", "10:14 AM")), row.names = c(NA, -4L), class = "data.frame")
I think the below should work for your goal:
df <- data.frame(datetime = apply(df,1, function(v) sprintf("%s/%s/%s %s",v["d"], v["m"], v["year"], v["time"])))
If you want to append the new column to the existing data.frame df, then use:
df$datetime <- apply(df,1, function(v) sprintf("%s/%s/%s %s",v["d"], v["m"], v["year"], v["time"]))

Resources