I have a data set that I'm cleaning. The 2nd column starts with a - while the value below it is the one I need. How do I replace the - with the value under it.
I have thousands of rows like this, with the value below it being different names so I cant just do
df$agent[df$agen == "-"] <- "john"
It would have to be done over 1,000 times. I'm looking for a way to do this much more efficiently.
1 Field Support - 6:00 AM - 6:59 AM 1/1/2020 9
3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
You can use case_when and lead from the dplyr package:
library(dplyr)
data %>%
mutate(Name = case_when(Name == "-" ~ lead(Name),
TRUE ~ Name))
Role Name Time Date Value
1 Field Support John 6:00 AM - 6:59 AM 1/1/2020 9
2 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
3 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
Data
data <- structure(list(Role = structure(c(1L, 1L, 1L), .Label = "Field Support", class = "factor"),
Name = structure(c(1L, 2L, 2L), .Label = c("-", "John"), class = "factor"),
Time = structure(1:3, .Label = c("6:00 AM - 6:59 AM", "7:00 AM - 7:59 AM",
"8:00 AM - 8:59 AM"), class = "factor"), Date = structure(c(1L,
1L, 1L), .Label = "1/1/2020", class = "factor"), Value = c(9L,
4L, 4L)), class = "data.frame", row.names = c(NA, -3L))
Here is a solution with base R and package tidyr.
library(tidyr)
col_num <- 3
is.na(df[[col_num]]) <- df[[col_num]] == '-'
fill(df, all_of(col_num), .direction = "up")
# V1 V2 V3 V4 V5 V6
#1 1 Field Support John 6:00 AM - 6:59 AM 1/1/2020 9
#2 3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
#3 4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
Data
df <- read.table(text = "
1 'Field Support' - '6:00 AM - 6:59 AM' 1/1/2020 9
3 'Field Support' John '7:00 AM - 7:59 AM' 1/1/2020 4
4 'Field Support' John '8:00 AM - 8:59 AM' 1/1/2020 4
")
Here's a solution without using any packages:
> df <- data.frame("ID" = c(1, 3, 4, 5, 8),
"Job" = rep("Field Support", 5),
"Agent" = c("-", rep("John", 2), "-", "Mary"),
"Hours" = c("6:00 AM - 6:59 AM",
"7:00 AM - 7:59 AM",
"8:00 AM - 8:59 AM",
"9:00 AM - 9:59 AM",
"10:00 AM - 10:59 AM"),
"Date" = rep("1/1/2020", 5),
"Metric" = c(9, 4, 4, 6, 2))
> print(df)
ID Job Agent Hours Date Metric
1 1 Field Support - 6:00 AM - 6:59 AM 1/1/2020 9
2 3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
3 4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
4 5 Field Support - 9:00 AM - 9:59 AM 1/1/2020 6
5 8 Field Support Mary 10:00 AM - 10:59 AM 1/1/2020 2
> df$Agent[which(df$Agent == "-")] <- df$Agent[which(df$Agent == "-") + 1]
> print(df)
ID Job Agent Hours Date Metric
1 1 Field Support John 6:00 AM - 6:59 AM 1/1/2020 9
2 3 Field Support John 7:00 AM - 7:59 AM 1/1/2020 4
3 4 Field Support John 8:00 AM - 8:59 AM 1/1/2020 4
4 5 Field Support Mary 9:00 AM - 9:59 AM 1/1/2020 6
5 8 Field Support Mary 10:00 AM - 10:59 AM 1/1/2020 2
Related
I'm working with a school schedule dataset for a visualization project and had days of classes originally in the form "MW" or "TTh" etc - they are now in the format below:
name start end first second
finance 9:00 10:00 M W
stats 10:30 11:30 T Th
econ 16:30 19:00 T NA
I'm looking to duplicate the first three columns to get a dataframe that looks like:
day name start end
M finance 9:00 10:00
W finance 9:00 10:00
T stats 10:30 11:30
Th stats 10:30 11:30
W econ 10:30 11:30
Any ideas?
We can use pivot_longer
library(dplyr)
library(tidyr)
pivot_longer(df1, cols = c(first, second), values_to = 'day',
names_to = 'name1') %>%
select(day, name, start, end) %>%
filter(complete.cases(day))
-output
# A tibble: 5 x 4
# day name start end
# <chr> <chr> <chr> <chr>
#1 M finance 9:00 10:00
#2 W finance 9:00 10:00
#3 T stats 10:30 11:30
#4 Th stats 10:30 11:30
#5 T econ 16:30 19:00
data
df1 <- structure(list(name = c("finance", "stats", "econ"), start = c("9:00",
"10:30", "16:30"), end = c("10:00", "11:30", "19:00"), first = c("M",
"T", "T"), second = c("W", "Th", NA)), class = "data.frame", row.names = c(NA,
-3L))
So I have a dataframe as such
ID Date TIME var Data misc
1 1/3/2018 3:30 AM a string1 string1
1 4/23/2019 1:32 PM b string2 string1
1 1/3/2018 4:53 PM c string3 string1
2 1/4/2018 3:32 AM d string4 string2
2 3/3/2018 3:30 PM s string5 string2
2 3/3/2018 3:30 PM e string6 string2
3 4/23/2019 6:24 AM w
3 4/23/2019 1:32 PM s
3 4/24/2019 3:20 PM s
3 4/24/2019 3:20 PM a
There are a number of columns similar to Data and misc which I would like to join to fill in the df, using another df comprised of ID = 3 data.
ID3_data
DATE Time Data misc
4/23/2019 6:24 AM string7 stringA
4/23/2019 1:32 PM string8 stringB
4/24/2019 3:20 PM string9 stringC
4/24/2019 3:20 PM string10 stringC
So how could I left join my DF with this ID3_data for just the rows where ID =3?
Additionally, there is another issue where the only identifier I have is the Date and TIME but I do have different matches with the same identifiers, is there a way to say the first instance goes to the first and second the second??? So in short the final DF should look as such:
ID Date TIME var Data misc
1 1/3/2018 3:30 AM a string1 string1
1 4/23/2019 1:32 PM b string2 string1
1 1/3/2018 4:53 PM c string3 string1
2 1/4/2018 3:32 AM d string4 string2
2 3/3/2018 3:30 PM s string5 string2
2 3/3/2018 3:30 PM e string6 string2
3 4/23/2019 6:24 AM w string7 stringA
3 4/23/2019 1:32 PM s string8 stringB
3 4/24/2019 3:20 PM s string9 stringC
3 4/24/2019 3:20 PM a string10 stringC
Again, the priority is joining select rows but if the duplicate issue could also be done in the same swoop using dplyr that would be great.
We could do a join with coalesce. Assuming the missing values as NA
library(dplyr)# 1.0.0
left_join(DF, ID3_data %>%
mutate(ID = 3), by = c('ID', 'Date' = 'DATE', 'TIME' = 'Time')) %>%
mutate(Data = coalesce(Data.x, Data.y), misc = coalesce(misc.x, misc.y))
Or if there are duplicates, then an option is bind the rows of the two dataset and then do a group by summarise with only non-NA rows (dplyr 1.0.0 allows summarise with more than one row)
cbind(ID = 3, ID3_data) %>%
set_names(names(DF)) %>%
bind_rows(DF) %>%
group_by(ID, Date, TIME) %>%
summarise(across(everything(), ~ .[!is.na(.)]))
# A tibble: 10 x 5
# Groups: ID, Date, TIME [8]
# ID Date TIME Data misc
# <dbl> <chr> <chr> <chr> <chr>
# 1 1 1/3/2018 3:30 AM string1 string1
# 2 1 1/3/2018 4:53 PM string3 string1
# 3 1 4/23/2019 1:32 PM string2 string1
# 4 2 1/4/2018 3:32 AM string4 string2
# 5 2 3/3/2018 3:30 PM string5 string2
# 6 2 3/3/2018 3:30 PM string6 string2
# 7 3 4/23/2019 1:32 PM string8 stringB
# 8 3 4/23/2019 6:24 AM string7 stringA
# 9 3 4/24/2019 3:20 PM string9 stringC
#10 3 4/24/2019 3:20 PM string10 stringC
data
DF <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
Date = c("1/3/2018", "4/23/2019", "1/3/2018", "1/4/2018",
"3/3/2018", "3/3/2018", "4/23/2019", "4/23/2019", "4/24/2019",
"4/24/2019"), TIME = c("3:30 AM", "1:32 PM", "4:53 PM", "3:32 AM",
"3:30 PM", "3:30 PM", "6:24 AM", "1:32 PM", "3:20 PM", "3:20 PM"
), Data = c("string1", "string2", "string3", "string4", "string5",
"string6", NA, NA, NA, NA), misc = c("string1", "string1",
"string1", "string2", "string2", "string2", NA, NA, NA, NA
)), class = "data.frame", row.names = c(NA, -10L))
ID3_data <- structure(list(DATE = c("4/23/2019", "4/23/2019", "4/24/2019",
"4/24/2019"), Time = c("6:24 AM", "1:32 PM", "3:20 PM", "3:20 PM"
), Data = c("string7", "string8", "string9", "string10"), misc = c("stringA",
"stringB", "stringC", "stringC")), class = "data.frame",
row.names = c(NA,
-4L))
Background:
I have a dataset, df,
Date Duration
1/2/2020 5:00:00 PM 20
1/2/2020 5:30:01 PM 30
1/2/2020 6:00:00 PM 10
1/5/2020 7:00:01 AM 5
1/6/2020 8:00:00 AM 2
1/6/2020 9:00:00 AM 8
Desired Output:
Date Total_Duration Count
1/2/2020 60 3
1/5/2020 5 1
1/6/2020 10 2
Dput:
structure(list(Date = structure(1:6, .Label = c("1/2/2020 5:00:00 PM",
"1/2/2020 5:30:01 PM", "1/2/2020 6:00:00 PM", "1/5/2020 7:00:01 AM",
"1/6/2020 8:00:00 AM", "1/6/2020 9:00:00 AM"), class = "factor"),
Duration = c(20L, 30L, 10L, 5L, 2L, 8L)), class = "data.frame", row.names = c(NA,
-6L))
What I have tried:
library(dplyr)
df %>% group_by(Date) %>% add_tally() %>%
summarize(Duration)
Any guidance will be helpful.
We can get the Date only part from the 'Date' after converting to 'DateTime' with dmy_hms (assuming the format is DD/MM/YYYYY HH::MM:SS), use that as grouping variable and get the sum of 'Duration' and 'Count' as the n()
library(dplyr)
library(lubridate)
df %>%
group_by(Date = as.Date(dmy_hms(Date))) %>%
summarise(Total_Duration = sum(Duration), Count = n())
# A tibble: 3 x 3
# Date Total_Duration Count
# <date> <int> <int>
#1 2020-02-01 60 3
#2 2020-05-01 5 1
#3 2020-06-01 10 2
I'm using below-mentioned dataframe in R:
ID Datetime Value
T-1 2020-01-01 15:12:14 10
T-2 2020-01-01 00:12:10 20
T-3 2020-01-01 03:11:11 25
T-4 2020-01-01 14:01:01 20
T-5 2020-01-01 18:07:11 10
T-6 2020-01-01 20:10:09 15
T-7 2020-01-01 15:45:23 15
Using the above mentioned dataframe, I want to segregate datetime on hourly basis. For which, I'm using following code.
library(tidyverse)
DF$bins <- cut(lubridate::hour(DF$Datetime), c(-1, 0:24 - 0.01))
levels(DF$bins) <- c("00:00 to 00:59", "00:01 to 01:59", "00:02 to 02:59", "00:03 to 03:59", "00:04 to 04:59", "00:05 to 05:59",
"00:06 to 06:59", "00:07 to 07:59", "00:08 to 08:59", "00:09 to 09:59", "00:10 to 10:59", "00:11 to 11:59",
"00:12 to 12:59", "00:13 to 13:59", "00:14 to 14:59", "00:15 to 15:59", "00:16 to 16:59", "00:17 to 17:59",
"00:18 to 18:59", "00:19 to 19:59", "00:20 to 20:59", "00:21 to 21:59", "00:22 to 22:59", "00:23 to 23:59")
newDF <- DF %>%
dplyr::group_by(bins, .drop = FALSE) %>%
dplyr::summarise(Count = length(Value), Total = sum(Value))
Final<-newDF %>%
dplyr::summarise(bins = "January", Count = sum(Count), Total = sum(Total)) %>% bind_rows(newDF)
Final[,c(2,3)]<-sapply(Final[,c(2,3)], function(x) scales::comma(x))
at levels(DF$bins)<- I'm getting error Error inlevels<-.factor(tmp, value = c("00:00 to 00:59", "00:01 to 01:59", :
number of levels differs
How to keep below mentioned segregation static and aggregate the numbers accordingly.
"00:00 to 00:59", "00:01 to 01:59", "00:02 to 02:59", "00:03 to 03:59", "00:04 to 04:59", "00:05 to 05:59", "00:06 to 06:59", "00:07 to 07:59", "00:08 to 08:59", "00:09 to 09:59", "00:10 to 10:59", "00:11 to 11:59","00:12 to 12:59", "00:13 to 13:59", "00:14 to 14:59", "00:15 to 15:59", "00:16 to 16:59", "00:17 to 17:59","00:18 to 18:59", "00:19 to 19:59", "00:20 to 20:59", "00:21 to 21:59", "00:22 to 22:59", "00:23 to 23:59"
Expected Output:
Month Count Sum
Jan-20 7 115
12:00 AM to 05:00 AM 2 45
06:00 AM to 12:00 PM 0 0
12:00 PM to 03:00 PM 1 20
03:00 PM to 08:00 PM 3 35
08:00 PM to 12:00 AM 1 15
We can use floor_date/ceiling_date from lubridate to create hourly breaks, create a grouping column (bins) based on our requirement using sprintf and then use this column to calculate whatever we want for each group.
library(dplyr)
library(lubridate)
df %>%
mutate(bins = floor_date(Datetime, "hour"),
hour = hour(bins),
bins = paste0(sprintf("%02d:00 :", hour), sprintf(" %02d:59", hour))) %>%
group_by(bins) %>%
summarise(sum = sum(Value))
# A tibble: 6 x 2
# bins sum
# <chr> <int>
#1 00:00 : 00:59 20
#2 03:00 : 03:59 25
#3 14:00 : 14:59 20
#4 15:00 : 15:59 25
#5 18:00 : 18:59 10
#6 20:00 : 20:59 15
For the updated condition, we can do
df %>%
mutate(hour = hour(Datetime),
gr = case_when(hour >= 0 & hour < 6 ~ "12:00 AM to 06:00 AM",
hour >= 6 & hour < 12 ~ "06:00 AM to 12:00 PM",
hour >= 12 & hour < 15 ~ "12:00 PM to 03:00 PM",
hour >= 15 & hour < 20 ~ "03:00 PM to 08:00 PM",
TRUE ~ "08:00 PM to 12:00 AM"),
month_year = format(Datetime, "%Y-%m"),
bins = factor(gr, levels = c("12:00 AM to 06:00 AM", "06:00 AM to 12:00 PM",
"12:00 PM to 03:00 PM", "03:00 PM to 08:00 PM",
"08:00 PM to 12:00 AM"))) %>%
group_by(month_year, bins, .drop = FALSE) %>%
summarise(sum = n())
# month_year bins sum
# <chr> <fct> <int>
#1 2020-01 12:00 AM to 06:00 AM 2
#2 2020-01 06:00 AM to 12:00 PM 0
#3 2020-01 12:00 PM to 03:00 PM 1
#4 2020-01 03:00 PM to 08:00 PM 3
#5 2020-01 08:00 PM to 12:00 AM 1
data
df <- structure(list(ID = structure(1:7, .Label = c("T-1", "T-2", "T-3",
"T-4", "T-5", "T-6", "T-7"), class = "factor"), Datetime = structure(c(1577891534,
1577837530, 1577848271, 1577887261, 1577902031, 1577909409, 1577893523
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Value = c(10L,
20L, 25L, 20L, 10L, 15L, 15L)), row.names = c(NA, -7L), class = "data.frame")
Here is my sample dataset
id hour
1 15:10
2 12:10
3 22:10
4 06:30
I need to find out the earliest time and latest time. The class of the hour is factor. So I need to convert factor to an appropriate class, and compare the earlier and later time. I tried to format the hour using the code below, but it did not work out as expected
format(as.Date(date),"%H:%M")
Use times of chron package
#Data
xx
# id hour
#1 1 15:10
#2 2 12:10
#3 3 22:10
#4 4 06:30
library(chron)
xx$hour = times(paste0(as.character(xx$hour), ":00"))
xx
# id hour
#1 1 15:10:00
#2 2 12:10:00
#3 3 22:10:00
#4 4 06:30:00
#Min and Max
range(xx$hour)
#[1] 06:30:00 22:10:00
xx = structure(list(id = 1:4, hour = structure(c(3L, 2L, 4L, 1L), .Label = c("06:30",
"12:10", "15:10", "22:10"), class = "factor")), .Names = c("id",
"hour"), row.names = c(NA, -4L), class = "data.frame")
If all you need is to find earliest (min) and latest (max) times, you can just convert the times to a character and use min, max: e.g.,
hour <- c("15:10", "12:10", "22:10", "06:30")
hour[which(hour == max(hour))]
> "22:10"