Related
I have been trying to wrap my head around this.
I need to create a corrected column based on detecting a specific comment at another "error" column in my database. I can work around this with grepl, but I am struggling with getting str_detect to work as well (it is usually faster for big datasets).
Here is an example database:
test <- tibble(
id = seq(1:30),
date = sample(seq(as.Date('2000/01/01'), as.Date('2018/01/01'), by="day"), 30),
error = c(rep(NA, 3), "wrong date! Correct date = 01.03.2022",
rep(NA, 5), "wrong date! Correct date = 01.05.2021",
rep(NA, 5), "wrong date! Correct date = 01.03.2022",
rep(NA, 7), "wrong date! Correct date = 01.05.2021",
rep(NA, 2), "date already corrected on 01.05.2021",
NA, "date already corrected on 01.03.2022", NA))
I first tried to create a new "date_corr" column with str_detect:
test %>%
mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))
This yields:
A tibble: 30 × 4
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA NA
2 2 2004-06-30 NA NA
3 3 2015-09-25 NA NA
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA NA
6 6 2004-08-02 NA NA
7 7 2001-10-15 NA NA
8 8 2007-07-21 NA NA
9 9 2014-04-19 NA NA
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
Adding rowwise is irrelevant:
test %>%
rowwise() %>%
mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))
A tibble: 30 × 4
# Rowwise:
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA NA
2 2 2004-06-30 NA NA
3 3 2015-09-25 NA NA
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA NA
6 6 2004-08-02 NA NA
7 7 2001-10-15 NA NA
8 8 2007-07-21 NA NA
9 9 2014-04-19 NA NA
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
However, with grepl I get the desired outcome, regardless of rowwise:
test %>%
mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))
# A tibble: 30 × 4
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA 2010-04-28
2 2 2004-06-30 NA 2004-06-30
3 3 2015-09-25 NA 2015-09-25
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA 2008-07-16
6 6 2004-08-02 NA 2004-08-02
7 7 2001-10-15 NA 2001-10-15
8 8 2007-07-21 NA 2007-07-21
9 9 2014-04-19 NA 2014-04-19
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
test %>%
rowwise() %>%
mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))
A tibble: 30 × 4
# Rowwise:
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA 2010-04-28
2 2 2004-06-30 NA 2004-06-30
3 3 2015-09-25 NA 2015-09-25
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA 2008-07-16
6 6 2004-08-02 NA 2004-08-02
7 7 2001-10-15 NA 2001-10-15
8 8 2007-07-21 NA 2007-07-21
9 9 2014-04-19 NA 2014-04-19
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
What I am missing here?
The difference is how they handle NA values
str_detect(NA, "missing")
# [1] NA
grepl("missing", NA)
# [1] FALSE
And note that if you have an NA value in the condition for if_else, it will also preserve the NA value
if_else(NA, 1, 2)
# [1] NA
The str_detect preserved the NA value. It's not clear what the "right" value should be. But if you want str_detect to have the same values as grepl, you can be explicit about not changing NA values
test %>%
mutate(date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))
I want to insert missing weeks for each household_id, channel combination so that weeks becomes in sequence. The corresponding duration column will be inserted with 0 and other columns value remains same.
Below is the dataset.
For e.g. household_id 100 and channel A: missing weeks are 37,39 and 41. I want these weeks to be inserted and duration will be 0.
But For household_id 101 and channel C: Two years are involved, 2019 and 2020. Missing are weeks 52 of 2019 and week 3 of 2020.
what I tried is below using complete function
library(tidyr)
library(dplyr)
temp <- data %>% group_by(Household_id,channel) %>%
tidyr::complete(week = seq(min(week),max(week)),fill = list(duration=0))
For Household_id 100 and channel A combination it worked fine. All weeks are now in sequence.
But for Household_id 101 and channel C it didn't worked. I want after inserting 52 week of 2019 it should go to 1st week of 2020.
I tried getting dates from week and year column thinking from exact date it may work
but not able to get that to work also.
data$date <- as.Date(paste(data$year,data$week,1,sep=""),"%Y%U%u")
Any help is greatly apprecited!
Here is the sample dataset with code:
library(dplyr)
library(tidyr)
data <- data.frame(Household_id = c(100,100,100,100,101,101,101,101,102,102),
channel = c("A","A","A","A","C","C","C","C","D","D"),
duration = c(12,34,567,67,98,23,56,89,73,76),
mala_fide_week = c(42,42,42,42,5,5,5,5,30,30),
mala_fide_year =c(2021,2021,2021,2021,2020,2020,2020,2020,2021,2021),
week =c(36,38,40,42,51,1,2,4,38,39),
year = c(2021,2021,2021,2021,2019,2020,2020,2020,2021,2021))
# imputing missing weeks and duration = 0 for each husehold channel combination
temp <- data %>% group_by(Household_id,channel) %>%
tidyr::complete(week = seq(min(week),max(week)),fill = list(duration=0))
# Getting Date from week/year if it may help
data$date <- as.Date(paste(data$year,data$week,1,sep=""),"%Y%U%u")
You can try defining the dates, making the sequence and converting to weeks. I used lubridate for ease.
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
group_by(Household_id,channel) %>%
mutate(new = paste0(year, '01-01'),
new = ymd(new) + 7 * week) %>%
complete(new = seq(min(new),max(new), by = 'week'), fill = list(duration=0)) %>%
mutate(year = replace(year, is.na(year), format(new, '%Y')[is.na(year)]),
week = week(new)) %>%
select(-new)
Household_id channel duration mala_fide_week mala_fide_year week year
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 100 A 12 42 2021 37 2021
2 100 A 0 NA NA 38 2021
3 100 A 34 42 2021 39 2021
4 100 A 0 NA NA 40 2021
5 100 A 567 42 2021 41 2021
6 100 A 0 NA NA 42 2021
7 100 A 67 42 2021 43 2021
8 101 C 98 5 2020 52 2019
9 101 C 0 NA NA 53 2019
10 101 C 0 NA NA 1 2020
11 101 C 0 NA NA 2 2020
12 101 C 0 NA NA 3 2020
13 101 C 0 NA NA 4 2020
14 102 D 73 30 2021 39 2021
15 102 D 76 30 2021 40 2021
16 101 C 23 5 2020 2 2020
17 101 C 56 5 2020 3 2020
18 101 C 89 5 2020 5 2020
I´ve got this dataset and want to perform some calculations based on certain conditions:
library(tidyverse)
library(lubridate)
filas <- structure(list(Año = c(rep(2020,4),rep(2021,4),2022),
Mes = c(2:5,3:4,9,11,1),
Id = c(rep(1,7),2,2)),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")) %>%
mutate(fecha = make_date(Año,Mes,1),
meses_imp = make_date(2999,1,1))
Año
Mes
Id
fecha
meses_imp
2020
2
1
2020-02-01
2999-01-01
2020
3
1
2020-03-01
2999-01-01
2020
4
1
2020-04-01
2999-01-01
2020
5
1
2020-05-01
2999-01-01
2021
3
1
2021-03-01
2999-01-01
2021
4
1
2021-04-01
2999-01-01
2021
9
1
2021-09-01
2999-01-01
2021
11
2
2021-11-01
2999-01-01
2022
1
2
2022-01-01
2999-01-01
I need to add rows for each "Id" when there are "holes" between two consecutive ones, and count those added rows later. I´ve achieved this using a "while" loop:
i <- 2
while(!is.na(filas[i,]$Id)) {
if (as.double(difftime(filas[i,]$fecha,filas[i-1,]$fecha)) > 31 &
filas[i,]$Id == filas[i-1,]$Id) {
filas <- add_row(filas,
Id = filas[i,]$Id,
fecha = filas[i-1,]$fecha + months(1),
meses_imp = pmin(filas[i-1,]$fecha,
filas[i-1,]$meses_imp),
.after = i-1)}
i=i+1}
filas2 <- filas %>%
group_by(Id,meses_imp) %>%
summarise(cant_meses_imp = n()) %>%
ungroup() %>%
filter(meses_imp != "2999-01-01")
filas <- left_join(filas,
filas2,
by=c("Id","meses_imp"))
Año
Mes
Id
fecha
meses_imp
cant_meses_imp
2020
2
1
2020-02-01
2999-01-01
NA
2020
3
1
2020-03-01
2999-01-01
NA
2020
4
1
2020-04-01
2999-01-01
NA
2020
5
1
2020-05-01
2999-01-01
NA
NA
NA
1
2020-06-01
2020-05-01
9
NA
NA
1
2020-07-01
2020-05-01
9
NA
NA
1
2020-08-01
2020-05-01
9
NA
NA
1
2020-09-01
2020-05-01
9
NA
NA
1
2020-10-01
2020-05-01
9
NA
NA
1
2020-11-01
2020-05-01
9
NA
NA
1
2020-12-01
2020-05-01
9
NA
NA
1
2021-01-01
2020-05-01
9
NA
NA
1
2021-02-01
2020-05-01
9
2021
3
1
2021-03-01
2999-01-01
NA
2021
4
1
2021-04-01
2999-01-01
NA
NA
NA
1
2021-05-01
2021-04-01
4
NA
NA
1
2021-06-01
2021-04-01
4
NA
NA
1
2021-07-01
2021-04-01
4
NA
NA
1
2021-08-01
2021-04-01
4
2021
9
1
2021-09-01
2999-01-01
NA
2021
11
2
2021-11-01
2999-01-01
NA
NA
NA
2
2021-12-01
2021-11-01
1
2022
1
2
2022-01-01
2999-01-01
NA
Since I`d like to apply this to a much larger dataset (~ 300k rows), how could I rewrite it in a vectorised way so it´s more efficient (and elegant maybe)?
Thanks!
You can apply the following code using padr and zoo packages.
This idea is to:
Add missing dates with the padr::pad() function.
Remove unwanted lines (non-integer Id values)
Create na and grp columns to identify rows added in 1.
Group by grp and create a column cant_meses_imp to count the number of consecutive na in each group
Select only desired columns
library(dplyr)
library(padr)
library(zoo)
filas %>%
pad(by = "fecha") %>% # add missing dates
mutate(Id = na.approx(Id)) %>% # interpolate NA values in Id column
subset(Id%%1 == 0) %>% # Keep only Id interger
# This part is for generating the cant_meses_imp column
mutate(na = ifelse(is.na(Mes), 1, 0),
grp = rle(na)$lengths %>% {rep(seq(length(.)), .)}) %>%
group_by(grp) %>%
mutate(cant_meses_imp = ifelse(na == 0, NA, n())) %>%
ungroup() %>%
select(-c(na, grp))
The code does not reproduce exactly the fecha column as there is no guidelines for its values.
Here is my toy dataset:
df <- tibble::tribble(
~date, ~value,
"2007-01-31", 25,
"2007-05-31", 31,
"2007-12-31", 26
)
I am creating month-end date series using the following code.
df %>%
mutate(date = as.Date(date)) %>%
complete(date = seq(as.Date("2007-01-31"), as.Date("2019-12-31"), by="month"))
However, I am not getting the correct month-end dates.
date value
<date> <dbl>
1 2007-01-31 25
2 2007-03-03 NA
3 2007-03-31 NA
4 2007-05-01 NA
5 2007-05-31 31
6 2007-07-01 NA
7 2007-07-31 NA
8 2007-08-31 NA
9 2007-10-01 NA
10 2007-10-31 NA
11 2007-12-01 NA
12 2007-12-31 26
What am I missing here? I am okay using other functions from any other package.
No need of complete function, you can do this in base R.
Since last day of the month is different for different months, we can create a sequence of monthly start dates and subtract 1 day from it.
seq(as.Date("2007-02-01"), as.Date("2008-01-01"), by="month") - 1
#[1] "2007-01-31" "2007-02-28" "2007-03-31" "2007-04-30" "2007-05-31" "2007-06-30"
# "2007-07-31" "2007-08-31" "2007-09-30" "2007-10-31" "2007-11-30" "2007-12-31"
Using the same logic in updated dataframe, we can do :
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
tidyr::complete(date = seq(min(date) + 1, max(date) + 1, by="month") - 1)
# date value
# <date> <dbl>
# 1 2007-01-31 25
# 2 2007-02-28 NA
# 3 2007-03-31 NA
# 4 2007-04-30 NA
# 5 2007-05-31 31
# 6 2007-06-30 NA
# 7 2007-07-31 NA
# 8 2007-08-31 NA
# 9 2007-09-30 NA
#10 2007-10-31 NA
#11 2007-11-30 NA
#12 2007-12-31 26
I am trying to use the na.locf function in a mutate and I am getting a strange answer. The data is ordered desc by date and then if a column is NA gets the result from na.locf and otherwise uses the value in the column. For most of the data, the answer is being returned as expected, but one row is coming back not as the previous non-NA but as the next non-NA. If we order the data by date ascending and use na.rm = F and fromLast = T it works as expected, but I want to understand why the result is not working if date is ordered descending.
The example is as follows:
example = data.frame(Date = factor(c("1/14/15", "1/29/15", "2/3/15",
"2/11/15", "2/15/15", "3/4/15","3/7/15", "3/7/15", "3/11/15",
"3/18/15", "3/21/15", "4/22/15", "4/22/15", "4/23/15", "5/6/15",
"5/13/15", "5/18/15", "5/24/15", "5/26/15", "5/28/15", "5/29/15",
"5/29/15", "6/25/15", "6/25/15","8/6/15", "8/15/15", "8/20/15",
"8/22/15", "8/22/15", "8/29/15")),
Scan = c(1, rep(NA, 21),2,rep(NA,7)),
Hours = c(rep(NA,3), rep(3,3), NA, 2, rep(3,3), NA, 2, 3, 2,
rep(3,5), NA, 2, rep(c(NA, 3),2), 3, NA, 2, 3)
)
example %>%
mutate(
date = as.Date(Date, "%m/%d/%y"),
Hours = replace_na(Hours,0),
scan_date = as.Date(ifelse(is.na(Scan),
NA,
date),
origin="1970-01-01")) %>%
arrange(desc(date)) %>%
mutate(
scan_new = ifelse(is.na(Scan),
na.locf(Scan),
Scan))
The issue in the result is in row 24, the Scan is coming in as 1 rather than 2:
Date Scan Hours date scan_date scan_new
23 3/7/15 NA 0 2015-03-07 <NA> 2
24 3/7/15 NA 2 2015-03-07 <NA> 1
25 3/4/15 NA 3 2015-03-04 <NA> 2
Interestingly, other data with the same date is handled appropriately, for example on line 18-19
Date Scan Hours date scan_date scan_new
18 4/22/15 NA 0 2015-04-22 <NA> 2
19 4/22/15 NA 2 2015-04-22 <NA> 2
For reference as noted above, the following provides the expected answer:
example %>%
mutate(
date = as.Date(Date, "%m/%d/%y"),
Hours = replace_na(Hours,0),
scan_date = as.Date(ifelse(is.na(Scan),
NA,
date),
origin="1970-01-01")) %>%
arrange(desc(date)) %>%
mutate(
scan_new = ifelse(is.na(Scan),
na.locf(Scan, na.rm = F, fromLast = T),
Scan))
Date Scan Hours date scan_date scan_new
6 3/4/15 NA 3 2015-03-04 <NA> 2
7 3/7/15 NA 0 2015-03-07 <NA> 2
8 3/7/15 NA 2 2015-03-07 <NA> 2
Can someone tell me why this is behaving this way?
In your first try na.locf(Scan), the leading NAs are removed and the remaining values are recycled to the full length in the ifelse. You can see the results with na.rm = F(or na.locf0, see comments) for reference:
example %>%
mutate(
date = as.Date(Date, "%m/%d/%y"),
Hours = replace_na(Hours,0),
scan_date = as.Date(ifelse(is.na(Scan),
NA,
date),
origin="1970-01-01")) %>%
arrange(desc(date)) %>%
mutate(
scan_new = ifelse(is.na(Scan),
na.locf(Scan, na.rm = FALSE),
Scan))
# Date Scan Hours date scan_date scan_new
# 1 8/29/15 NA 3 2015-08-29 <NA> NA
# 2 8/22/15 NA 0 2015-08-22 <NA> NA
# 3 8/22/15 NA 2 2015-08-22 <NA> NA
# 4 8/20/15 NA 3 2015-08-20 <NA> NA
# 5 8/15/15 NA 3 2015-08-15 <NA> NA
# 6 8/6/15 NA 0 2015-08-06 <NA> NA
# 7 6/25/15 2 0 2015-06-25 2015-06-25 2
# 8 6/25/15 NA 3 2015-06-25 <NA> 2
# 9 5/29/15 NA 0 2015-05-29 <NA> 2
# 10 5/29/15 NA 2 2015-05-29 <NA> 2
# 11 5/28/15 NA 3 2015-05-28 <NA> 2
# 12 5/26/15 NA 3 2015-05-26 <NA> 2
# 13 5/24/15 NA 3 2015-05-24 <NA> 2
# 14 5/18/15 NA 3 2015-05-18 <NA> 2
# 15 5/13/15 NA 3 2015-05-13 <NA> 2
# 16 5/6/15 NA 2 2015-05-06 <NA> 2
# 17 4/23/15 NA 3 2015-04-23 <NA> 2
# 18 4/22/15 NA 0 2015-04-22 <NA> 2
# 19 4/22/15 NA 2 2015-04-22 <NA> 2
# 20 3/21/15 NA 3 2015-03-21 <NA> 2
# 21 3/18/15 NA 3 2015-03-18 <NA> 2
# 22 3/11/15 NA 3 2015-03-11 <NA> 2
# 23 3/7/15 NA 0 2015-03-07 <NA> 2
# 24 3/7/15 NA 2 2015-03-07 <NA> 2
# 25 3/4/15 NA 3 2015-03-04 <NA> 2
# 26 2/15/15 NA 3 2015-02-15 <NA> 2
# 27 2/11/15 NA 3 2015-02-11 <NA> 2
# 28 2/3/15 NA 0 2015-02-03 <NA> 2
# 29 1/29/15 NA 0 2015-01-29 <NA> 2
# 30 1/14/15 1 0 2015-01-14 2015-01-14 1