In my dataframe, I have a column "dates" and I would like for R to walk through each row of dates in a loop to see if the date before or after it is within a 3-14 day range, and if not, it's indexed to a list to be removed at the end of the loop.
for example:
my_dates <- c( "1/4/2019", "1/18/2019", "4/3/2019", "2/20/2019", "4/5/2019")
I would want to remove the entire row containing 2/20/2019 because there is no other date that is within 3-14 days of that date.
Any help would be greatly appreciated!
Use a bit of ordering and diffing:
my_dates <- c( "1/4/2019", "1/18/2019", "4/3/2019", "2/20/2019", "4/5/2019")
my_dates <- as.Date(my_dates, format="%m/%d/%Y")
o <- order(my_dates)
d <- abs(diff(my_dates[o]))
my_dates[o[ c(Inf,d) <= 14 | c(d,Inf) <= 14 ]]
#[1] "2019-01-04" "2019-01-18" "2019-04-03" "2019-04-05"
Here's a verbose way using lubridate and dplyr.
my_dates <- c( "1/4/2019", "1/18/2019", "4/3/2019", "2/20/2019", "4/5/2019")
library(lubridate); library(dplyr)
df <- data.frame(dates = mdy(my_dates)) %>%
arrange(dates) %>%
mutate(days_prior = dates - lag(dates),
days_before = lead(dates) - dates) %>%
mutate(closest_day = pmin(days_prior, days_before, na.rm = T)) %>%
filter(closest_day <= 14)
Here is one way from outer, data from thelatemail
s=abs(-outer(my_dates, my_dates, '-'))
my_dates[rowSums(s<=14)>1]
[1] "2019-01-04" "2019-01-18" "2019-04-03" "2019-04-05"
Related
I have a large database with one of the columns containing dates with the following format: DD-MM-YYYY.
I would like to invert the date format, to something like this: YYYY-MM-DD.
Can someone tell me how can I do it using bash OR R?
A possible solution:
library(tidyverse)
library(lubridate)
df <- data.frame(date=c("11-4-2021","5-6-2019"))
df %>%
mutate(date2 = dmy(date) %>% ymd)
#> date date2
#> 1 11-4-2021 2021-04-11
#> 2 5-6-2019 2019-06-05
In bash, we can use string manipulation:
dmy=30-12-2021
echo "${dmy:6:4}-${dmy:3:2}-${dmy:0:2}" # 2021-12-30
or with read:
IFS="-" read -r d m y <<<"$dmy"
echo "$y-$m-$d"
I used R to solve my problem like this:
df > data.frame with dates on column "eventDate". Dates were in the format DD-MM-YYYY. There were several cells with incomplete dates (e.g. MM-YYYY or YYYY).
library(tidyr)
x <- separate(df, col = eventDate, into = c("day", "month", "year"), sep="-")
y <- x %>% unite("eventDate_2", year:month:day, remove=TRUE, sep="-", na.rm= TRUE)
y <- cbind(y, df$eventDate) # add the original column for comparing if it had worked and correct individual errors.
echo "${dmy:6:4}-${dmy:3:2}-${dmy:0:2}"
Data example.
date1 = seq(as.Date("2019/01/01"), by = "month", length.out = 29)
date2= seq(as.Date("2019/05/01"), by = "month", length.out = 29)
subproducts1=rep("1",29)
subproducts2=rep("2",29)
b1 <- c(rnorm(29,5))
b2 <- c(rnorm(29,5))
dfone <- data.frame("date"= c(date1,date2),
"subproduct"=
c(subproducts1,subproducts2),
"actuals"= c(b1,b2))
Max Date for Subproduct 1 is May 2021 and max date for Subproduct 2 is Sept 2021.
Question: Is there a way to:
Find the max date for both unique subproduct and
Find the minimum date from the two max dates all in one step?
The final result should be May 2021 in this case and able to handle multiple subproducts.
We may use slice_max after grouping by 'subproduct', pull the date and get the min, assign it to a new object
library(dplyr)
dfone %>%
group_by(subproduct) %>%
slice_max(n = 1, order_by = date) %>%
ungroup %>%
pull(date) %>%
min -> Min_date
-output
Min_date
[1] "2021-05-01"
Another option is to arrange the rows and filter using duplicated
dfone %>%
arrange(subproduct, desc(date)) %>%
filter(!duplicated(subproduct)) %>%
pull(date) %>%
min
For your first goal, you can try subset + ave like below
out1 <- subset(
dfone,
ave(date, subproduct, FUN = max) == date
)
which gives
date subproduct actuals
29 2021-05-01 1 5.728420
58 2021-09-01 2 3.455491
For your second goal, based on out1, you can try
out2 <- subset(
out1,
date == min(date)
)
which gives
date subproduct actuals
29 2021-05-01 1 5.083229
This could also be done in base R. In the end I used Reduce so that the solution can be generalized to any number of subproducts and dates and not just 2 values as is the case here.
Reduce(function(x, y) min(x, y),
lapply(unique(dfone$subproduct), \(x){
max(dfone$date[dfone$subproduct == x])
}))
[1] "2021-05-01"
For the sake of completeness, here are also data.table and sqldf solutions:
1. data.table
library(data.table)
setDT(dfone)[, max(date), by = subproduct][, min(V1)]
[1] "2021-05-01"
2. sqldf
sqldf::sqldf("
select min(date) from (
select max(date) as date from dfone group by subproduct
)", method = "Date")
min(date)
1 2021-05-01
Another attempt - sort the dfone data by date descending, find the first instance of each subproduct, and take the minimum:
with(dfone[order(dfone$date, decreasing=TRUE),],
min(date[match(unique(subproduct), subproduct)]))
#[1] "2021-05-01"
Though the question has been marked as solved, yet one more hack where you can use {} anonymous call
library(dplyr)
dfone %>% group_by(subproduct) %>%
summarise(d = max(date), .groups = 'drop') %>%
{min(.$d)}
#> [1] "2021-05-01"
Created on 2021-07-16 by the reprex package (v2.0.0)
I'm trying to use dplyr in R to difference a variable between two dates.
An simplified example:
# Simple script to test calculating the difference of a column between two dates
library(dplyr)
library(lubridate)
library(tibble)
dataA <- as.tibble(ymd('2020-01-01') + days(seq(0:45)))
colnames(dataA) = c('date')
dataA <- dataA %>% mutate(xvar = seq(0:45))
#add the difference in xvar between two dates
dataA <- dataA %>% mutate(startd = date, endd=date+days(3))
dataA <- dataA %>% group_by(date) %>%
filter(date >= startd & date <= endd) %>% mutate(vardiff = last(xvar)-first(xvar))
I've tried a number of different possibilities for this last statement but can't get the calculation I'm looking for. What I'm trying to achieve is the difference in xvar between January 5th and January 2nd and so on for the entire time series. How can this be achieved using dplyr statements?
Thanks!
We can use findInterval and this should also work when there are no exact matches
library(dplyr)
dataA %>%
mutate(vardiff = xvar[findInterval(endd, date)] -
xvar[findInterval(startd, date)])
Or in base R
transform(dataA, vardiff = xvar[findInterval(endd, date)] -
xvar[findInterval(startd, date)])
You can use match to get index of startd and endd to get corresponding xvar and subtract them:
library(dplyr)
dataA %>%
mutate(vardiff = xvar[match(endd, date)] - xvar[match(startd, date)])
This can also be written in base R using transform :
transform(dataA, vardiff = xvar[match(endd, date)] - xvar[match(startd, date)])
I've got a list of 17 dataframes and a list of 17 dates. They are ordered and correspond to each other. In other words, list_of_dfs[[1]] corresponds to dates[[1]] and so forth. The list of dates, below, are date objects using lubridate::ymd.
> dates
[1] "2004-10-10" "2005-10-10" "2006-10-10" "2007-10-10" "2008-10-10" "2009-10-10" "2010-10-10" "2011-10-10" "2012-10-10" "2013-10-10" "2014-10-10" "2015-10-10" "2016-10-10" "2017-10-10"
[15] "2018-10-10" "2019-10-10" "2020-10-10"
I would like to mutate a subset of variables in each dataframe such that I am subtracting the subset from the corresponding object in dates. For example, I could do the following for the first item.
list_of_dfs[[1]] <- list_of_dfs[[1]] %>% `
mutate_at(.vars = vars(contains('string')),
.funs = funs(dates[[1]] - .)
Is there a way that I incorporate the above into a map or lapply like command that will allow me to iterate through dates?
My closet approximation would be something like
list_of_dfs <- list_of_dfs %>%
map(., function(x) mutate_at(x,
.vars = vars(contains('string')),
.funs = funs(dates - .)))
which can't take a list object in .funs as shown above.
We can use map2 as we are doing the subtraction from corresponding elements of 'dates' list
library(dplyr)
library(purrr)
list_of_dfs2 <- map2(list_of_dfs, dates, ~ {date <- .y
.x %>%
mutate_at(vars(contains('string')), ~ date - as.Date(.))})
In the devel version of dplyr, across can be used along with mutate
list_of_dfs2 <- map2(list_of_dfs, dates, ~ { date <- .y
.x %>%
mutate(across(contains('string'), ~ date - as.Date(.x)))
})
data
list_of_dfs <- list(data.frame(string1 = Sys.Date() - 1:6, string2 = Sys.Date()),
data.frame(string1 = Sys.Date() - 1:6, string2 = Sys.Date()))
dates <- Sys.Date() + 1:2
I have the following dataframe
Date Time
10/03/2014 12.00.00
11/03/2014 13.00.00
12/03/2014 14.00.00
I want to create one single column as follows
DT
10/03/2014 12.00.00
11/03/2014 13.00.00
12/03/2014 14.00.00
when I run
data$DT <- as.POSIXct(paste(x$Date, x$Time), format="%d-%m-%Y %H:%M:%S")
I get a column DT with all NA values.
Data$DT <- as.POSIXct(as.character(paste(data$Date, data$Time)), format="%d/%m/%Y %H.%M.%S")
OR
data$Time <- gsub('\\.',':',data$Time)
data$Date <- gsub('/','-',data$Date)
data$DT <- as.POSIXct(as.character(paste(data$Date, data$Time)), format="%d-%m-%Y %H:%M:%S")
Use the package lubridate:
data$DT <- with(data, ymd(Date) + hms(Time))
If you want the column to be a POSIXct, do the following after that:
data$DT <- as.POSIXct(data$DT)
This should be a very common problem, hence contributing with a reproducible answer using dplyr:
## reproducible example
library(dplyr)
library(magrittr)
DF <- data.frame(Date = c("10/03/2014", "11/03/2014", "12/03/2014"),
Time = c("12.00.00", "13.00.00", "14.00.00"))
DF_DT <- DF %>%
mutate(DateTime = paste(Date, Time)) %>%
mutate(across('DateTime', ~ as.POSIXct(.x, format = "%d/%m/%Y %H.%M.%S")))