change repeating row value based on values on another column - r

Based on the data and code below, how can I change the value 1/0/1900 to values based on the corresponding row value in the year column?
Data + code:
df = structure(list(year = c("2005", "2004", "ORIG", "ORIG", "2000-2001",
"2000-2003"), date = c("1/0/1900", "1/0/1900", "1/0/1900", "1/0/1900",
"1/0/1900", "1/0/1900")), class = "data.frame", row.names = c(NA,
-6L))
desired = structure(list(year = c("2005", "2004", "ORIG", "ORIG", "2000-2001",
"2000-2001"), date = c("01/01/2005", "01/01/2004", "01/01/2005", "01/01/2005",
"01/01/2000", "01/01/2000")), class = "data.frame", row.names = c(NA,
-6L))
# Current approach replaces every `1/0/1900` to `01/01/2005`
df = df %>% mutate(date = ifelse(date == "1/0/1900",
"01/01/2005",
date))

This feels clunky, perhaps I'm missing something?
df %>%
mutate(
date = paste0(sub("/[^/]*$", "/", date),
substring(if_else(year == "ORIG", first(year), year), 1, 4))
)
# year date
# 1 2005 1/0/2005
# 2 2004 1/0/2004
# 3 ORIG 1/0/2005
# 4 ORIG 1/0/2005
# 5 2000-2001 1/0/2000
# 6 2000-2003 1/0/2000
However, if you are finding /1900 dates in your data, that suggests that a database somewhere had a "null" that was converted into a zero-date like value. Perhaps it should be handled more directly at the data source?
Regex explanation:
/[^/]*$
^ literal '/'
^^^^^ any characters that are not the literal '/'
^ end of string
Essentially: remove from the end of string any non-/. We could also have used substring(date, 1, nchar(date) - 4) in place of sub(..):
df %>%
mutate(
date = paste0(substring(date, 1, nchar(date) - 4),
substring(if_else(year == "ORIG", first(year), year), 1, 4))
)

We could use case_when to replace the last 4 digits in 'date' where the 'year' is 'ORIG' to 2005 or replace with substring of year column
library(dplyr)
library(stringr)
df %>%
mutate(date = case_when(year == 'ORIG' ~
str_replace(date, '\\d{4}$', '2005'),
TRUE ~ str_replace(date, '\\d{4}$', substr(year, 1, 4))))
-output
year date
1 2005 1/0/2005
2 2004 1/0/2004
3 ORIG 1/0/2005
4 ORIG 1/0/2005
5 2000-2001 1/0/2000
6 2000-2003 1/0/2000

Related

Problems generating output table due to different column names

I would like to solve some problems with the column name, which corroborates with errors when executing a code. Here, I'll show you a simple example. Note that I have a column called TimeofCalculate and the code below is Timeofcalculate, which gives an error, because the code is calculate instead of Calculate. However, I would like any of them worked in the code. Also, I have a database which is Timeofcalculâte column. This â is common where I live. Therefore, I would like to resolve these mentioned issues.
library(dplyr)
Test <- structure(list(date1 = as.Date(c("2021-11-01","2021-11-01","2021-11-01","2021-11-01")),
date2 = as.Date(c("2021-10-22","2021-10-22","2021-10-28","2021-10-30")),
Week = c("Friday", "Friday", "Thursday", "thursday"),
Category = c("FDE", "FDE", "FDE", "FDE"),
TimeofCalculate = c(4, 6, 6, 3)), class = "data.frame",row.names = c(NA, -4L))
Test %>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarise(Time=mean(Timeofcalculate), .groups = 'drop')
I think weekdays in different spellings are unacceptable in a data base, first fix this. We may use built-in tools::toTitleCase to make first letters upper-case.
Test <- transform(Test, Week=tools::toTitleCase(Week))
Then, we may easily aggregate by column numbers, so no names are needed.
aggregate(list(Time=Test[, 5]), list(Week=Test[, 3]), mean)
# Week Time
# 1 Friday 5.0
# 2 Thursday 4.5
If it's a problem to hard-code column indices by hand, we may use agrep which identifies via string distance matching the index of the most similar column name.
c_tcalc <- agrep('timeofcalculate', names(Test))
c_week <- agrep('week', names(Test))
aggregate(list(Time=Test[, c_tcalc]), list(Week=Test[, c_week]), mean)
# Week Time
# 1 Friday 5.0
# 2 Thursday 4.5
Data:
Test <- structure(list(date1 = structure(c(18932, 18932, 18932, 18932
), class = "Date"), date2 = structure(c(18922, 18922, 18928,
18930), class = "Date"), Week = c("Friday", "Friday", "Thursday",
"Thursday"), Category = c("FDE", "FDE", "FDE", "FDE"), TimeofCalculate = c(4,
6, 6, 3)), class = "data.frame", row.names = c(NA, -4L))
Perhaps we can take advantage of tidyselect::matches.
library(dplyr)
nms <- c('TimeofCalculate|Timeofcalculate|Timeofcalculâte')
#alternative one
Test %>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarise(across(matches(nms), mean), .groups = 'drop')
#> # A tibble: 2 × 2
#> Week TimeofCalculate
#> <chr> <dbl>
#> 1 Friday 5
#> 2 Thursday 4.5
#using a purrr style lambda
Test %>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarise(across(matches(nms), ~mean(., na.rm = TRUE)), .groups = 'drop')
#> # A tibble: 2 × 2
#> Week TimeofCalculate
#> <chr> <dbl>
#> 1 Friday 5
#> 2 Thursday 4.5
#this will also work
Test %>%
group_by(Week = tools::toTitleCase(Week)) %>%
summarise(across(any_of(c("Timeofcalculate", "TimeofCalculate", "Timeofcalculâte")), ~ mean(., na.rm = TRUE)), .groups = "drop")
Created on 2021-12-26 by the reprex package (v2.0.1)

The union of all intervals for each group

I have a following example dataset:
df <- data.frame("id" = c(1,2,3,3,4),
"start" = c(01-01-2018,01-06-2018,01-05-2018,01-05-2018,01-05-2018, 01-10-2018),
"end" = c(01-03-2018,01-07-2018,01-09-2018,01-06-2018,01-06-2018,01-11-2018))
df$start <- as.Date(df$start, "%d-%m-%Y")
df$end <- as.Date(df$end, "%d-%m-%Y")
What I want to do with it is for each group to get a union of all date intervals), i.e.
01-01-2018 - 01-03-2018 for group 1
01-06-2018 - 01-06-2018 for group 2
01-05-2018 - 01-09-2018 for group 3
01-05-2018 - 01-06-2018 and 01-10-2018 - 01-11-2018 for group 4
The purpose of this is to have an interval as an output, because I need it to determine whether certain observation dates for the group fall in the intervals or not.
We convert the 'start', 'end' to Date class, then grouped by'id', created an interval column in summarise based on the min and max of the 'start', and 'end' columns respectively
library(dplyr)
library(stringr)
library(lubridate)
df %>%
mutate(across(c(start, end), mdy)) %>%
group_by(id) %>%
summarise(interval = interval(min(start), max(end)), .groups = 'drop')
data
df <- structure(list(id = c(1, 2, 3, 3, 4), start = c("01-01-2018",
"01-06-2018", "01-05-2018", "01-05-2018", "01-10-2018"), end = c("01-03-2018",
"01-07-2018", "01-09-2018", "01-06-2018", "01-11-2018")),
class = "data.frame", row.names = c(NA,
-5L))

How to put/save all elements of a List into one Excel sheet in R?

I have a list (bbb) with 5 elements in it, i.e., each element for a year, like 2010, 2011, ... , 2014:
The first one in the list is this:
> bbb[1]
$`2010`
Date Average
X2010.01.01 2010-01-01 2.079090e-03
X2010.01.02 2010-01-02 5.147627e-04
X2010.01.03 2010-01-03 2.997464e-04
X2010.01.04 2010-01-04 1.375538e-04
X2010.01.05 2010-01-05 1.332109e-04
The second one in the list is this:
> bbb[2]
$`2011`
Date Average
X2011.01.01 2011-01-01 1.546253e-03
X2011.01.02 2011-01-02 1.152864e-03
X2011.01.03 2011-01-03 1.752446e-03
X2011.01.04 2011-01-04 2.639658e-03
X2011.01.05 2011-01-05 5.231150e-03
X2011.01.06 2011-01-06 8.909878e-04
And so on.
Here is my question:
How can I save all of these list's elements in 1 sheet of an Excel file to have something like this:
Your help would be highly appreciated.
You can do this using dcast.
bbb <- list(`2010` = data.frame(date = as.Date("2010-01-01") + 0:4,
avg = 1:5),
`2011` = data.frame(date = as.Date("2011-01-01") + 0:5,
avg = 11:16),
`2012` = data.frame(date = as.Date("2012-01-01") + 0:9,
avg = 21:30),
`2013` = data.frame(date = as.Date("2013-01-01") + 0:7,
avg = 21:28))
df <- do.call("rbind", bbb)
df$year <- format(df$date, format = "%Y")
df$month_date <- format(df$date, format = "%b-%d")
library(data.table)
library(openxlsx)
df_dcast <- dcast(df, month_date~year, value.var = "avg")
write.xlsx(df_dcast, "example1.xlsx")
Or using spread
library(dplyr)
library(tidyr)
df2 <- df %>%
select(-date) %>%
spread(key = year, value = avg)
write.xlsx(df2, "example2.xlsx")
This isn't very pretty, but it's the best I could think of right now. But you could take the dataframes and loop through the list, joining them by date like this:
library(tidyverse)
library(lubridate)
bbb <- list(`2010` = tibble(date = c('01-01-2010', '01-02-2010', '01-03-2010', '01-04-2010', '01-05-2010'),
average = 11:15),
`2011` = tibble(date = c('01-01-2011', '01-02-2011', '01-03-2011', '01-04-2011', '01-05-2011'),
average = 1:5),
`2012` = tibble(date = c('01-01-2012', '01-02-2012', '01-03-2012', '01-04-2012', '01-05-2012'),
average = 6:10))
for (i in seq_along(bbb)) {
if(i == 1){
df <- bbb[[i]] %>%
mutate(
date = paste(day(as.Date(date, format = '%m-%d-%Y')),
month(as.Date(date, format = '%m-%d-%Y'), label = TRUE),
sep = '-')
)
colnames(df) <- c('date', names(bbb[i])) # Assuming your list of dataframes has just 2 columns: date and average
} else {
join_df <- bbb[[i]] %>%
mutate(
date = paste(day(as.Date(date, format = '%m-%d-%Y')),
month(as.Date(date, format = '%m-%d-%Y'), label = TRUE),
sep = '-')
)
colnames(join_df) <- c('date', names(bbb[i]))
df <- full_join(df, join_df, by = 'date')
}
}
This loops through the list of dataframes and reformats the dates to Day-Month.
# A tibble: 5 x 4
date `2010` `2011` `2012`
<chr> <int> <int> <int>
1 1-Jan 11 1 6
2 2-Jan 12 2 7
3 3-Jan 13 3 8
4 4-Jan 14 4 9
5 5-Jan 15 5 10
You could then write that out with the writexl package function write_xlsx

Plotting the occurrence of an item per date

I have this kind of data:
dat
date shop_id
1 2013-01 1
2 2013-02 2
3 2013-02 2
4 2013-02 2
5 2013-02 1
6 2013-03 3
7 2013-04 1
shop_id stands for a specific shop and year_month stands for the date. If a shop is listed at a specific date it means that it´s open, if not it´s closed (i.e. in Janury 2013/2013-01 shop 1 was open but not shop 2 and 3, in March 2013/2013-03 shop 3 was open but not shop 1 and 2). Since the data is about sales of a specific product, a shop can occur more than once per date. I want to plot the data.
It should look like the plot below: On the y-axis should be the date, on the x-axis should be the shop_id and fill should be if the shop is open (shop_id occurs together with a specific date) or not.
dput(dat)
structure(list(date = structure(c(1L, 2L, 2L, 2L, 2L, 3L, 4L), .Label = c("2013-01",
"2013-02", "2013-03", "2013-04"), class = "factor"), shop_id = c(1,
2, 2, 2, 1, 3, 1)), class = "data.frame", row.names = c(NA, -7L
))
Is this what you are looking for?
library(tidyverse)
library(lubridate)
df %>%
group_by(shop_id) %>%
mutate(
date = ymd(paste0(date, "-01")),
start = min(date),
end = max(date) %>% ceiling_date(unit = "month") # as Julian_Hn suggested
) %>%
ungroup() %>%
ggplot(aes(x = factor(shop_id))) +
geom_linerange(aes(
ymin = start,
ymax = end
),
size = 40 # bar width
)
Second proposition:
library(tidyverse)
df %>%
group_by(date) %>%
nest() %>%
arrange(date) %>%
mutate(ypos = row_number()) %>%
unnest() %>%
ggplot() +
geom_rect(aes(
xmin = shop_id - .25,
xmax = shop_id + .25,
ymin = ypos - .5,
ymax = ypos + .5
))

dplyr: Replace NAs and 0s with conditional subgroup means

I'm trying to replace all NAs and 0s in a large dataset with their respective group mean -- computed on the basis of cases that are not NA or 0.
Source: local data frame [174,019 x 3]
Groups: name
student name hours
1 s1 ABC 1.0
2 s1 DEF NA
3 s2 DEF 0.5
4 s3 NA 2.0
5 s3 ABC 2.0
6 s4 GHI 0
This solution using dplyr works as intended, but can this be done in one chain?
avg <- workshops %>%
filter(hours > 0 & !is.na(name)) %>%
group_by(name) %>%
summarize(avg.hours = mean(hours, na.rm = TRUE))
workshops <- workshops %>%
left_join(avg, by = "name") %>%
mutate(hours = if_else(hours > 0, hours, avg.hours, avg.hours)) %>%
select(-avg.hours)
Updated solution
workshop <- workshop %>%
group_by(name) %>%
mutate(hours = ifelse(!is.na(name), replace(hours, hours == 0 | is.na(hours),
mean(`is.na<-`(hours, hours == 0), na.rm = TRUE)), NA))
You can do:
workshop%>%
group_by(name)%>%
mutate(hours=replace(hours,hours==0|is.na(hours),
mean(`is.na<-`(hours,hours==0),na.rm = T)))
Here is an option with na.aggregate from zoo. After grouping by 'name', change the 0's to NA with na_if and apply na.aggregate to replace the missing values with the mean (by default, the FUN parameter is mean)
library(dplyr)
library(zoo)
workshops %>%
group_by(name) %>%
mutate(hours = na.aggregate(na_if(hours, 0)))
data
workshops <- structure(list(student = c("s1", "s1", "s2", "s3", "s3",
"s4"), name = c("ABC", "DEF", "DEF", NA, "ABC", "GHI"),
hours = c(1, NA, 0.5, 2, 2, 0)), .Names = c("student", "name", "hours"),
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6"))

Resources