I am trying to solve a problem in R. I have 2 data frames which look like this:
df1 <-
Date Rainfall_Duration
6/14/2016 10
6/15/2016 20
6/17/2016 10
8/16/2016 30
8/19/2016 40
df2 <-
Date Removal.Rate
6/17/2016 64.7
6/30/2016 22.63
7/14/2016 18.18
8/19/2016 27.87
I want to look up the dates from df2 in df1 and their corresponding Rainfall_Duration data. For example, I want to look for the 1st date of df2 in df1 and subset rows in df1 for that specific date and 7 days prior to that. additionally, for example: for 6/30/2016 (in df2) there is no dates available in df1 within it's 7 days range. So, in this case I just want to extract the results same as it's previous date (6/17/2016) in df2. Same logic goes for 7/14/2016(df2).
The output should look like this:
df3<-
Rate.Removal.Date Date Rainfall_Duration
6/17/2016 6/14/2016 10
6/17/2016 6/15/2016 20
6/17/2016 6/17/2016 10
6/30/2016 6/14/2016 10
6/30/2016 6/15/2016 20
6/30/2016 6/17/2016 10
7/14/2016 6/14/2016 10
7/14/2016 6/15/2016 20
7/14/2016 6/17/2016 10
8/19/2016 8/16/2016 30
8/19/2016 8/19/2016 40
I could subset data for the 7 days range. But could not do it when no dates are available in that range. I have the following code:
library(plyr)
library (dplyr)
df1$Date <- as.Date(df1$Date,format = "%m/%d/%Y")
df2$Date <- as.Date(df2$Date,format = "%m/%d/%Y")
df3 <- lapply(df2$Date, function(x){
filter(df1, between(Date, x-7, x))
})
names(df3) <- as.character(df2$Date)
bind_rows(df3, .id = "Rate.Removal.Date")
df3 <- ldply (df3, data.frame, .id = "Rate.Removal.Date")
I hope I could explain my problem properly. I would highly appreciate if someone can help me out with this code or a new one. Thanks in advance.
I would approach this by explicitly generating all of the dates on which you want to collect the rainfall duration then binding.
To that end, I split each row and generated a rainfallDate column for each that included the seven previous days. Then, I bound the rows back together and used left_join to get the rainfall data. Finally, I filter out the rows with missing rainfall information
df2 %>%
split(1:nrow(.)) %>%
lapply(function(x){
data.frame(
x
, rainfallDate = seq(x$Date - 7, x$Date, 1)
)
}) %>%
bind_rows() %>%
left_join(df1, by = c(rainfallDate = "Date")) %>%
filter(!is.na(Rainfall_Duration))
gives
Date Removal.Rate rainfallDate Rainfall_Duration
1 2016-06-17 64.70 2016-06-14 10
2 2016-06-17 64.70 2016-06-15 20
3 2016-06-17 64.70 2016-06-17 10
4 2016-08-19 27.87 2016-08-16 30
5 2016-08-19 27.87 2016-08-19 40
Note that if the dates without rainfall information are actually 0's, you could skip the filter line and use replace_na from tidyr to set them to explicit zeros (e.g., if you want average rainfall or to include the removal dates without any days of rainfall ahead of them).
An alternative is possible if you are actually interested in a summary value for each date (e.g., total rain in the past seven days). First, I would generate a rainfall dataset that actually had entries for every day of observation. For that, I used complete from tidyr to add an explicit zero entry for each day then used rollapply from zoo to calculate the sum in the rolling previous seven days.
completeRainfall <-
df1 %>%
complete(Date = full_seq(c(as.Date("2016-06-10"), max(Date)), 1)
, fill = list(Rainfall_Duration = 0)) %>%
mutate(prevSeven = rollapply(Rainfall_Duration, 7, sum
, fill = NA, align = "right"))
Then, a simple join will give you both that day's rainfall and the summarized total from the past seven days. This is particularly useful if your Dates for df1 are ever within seven days of each other to avoid copying the same rows multiple times.
left_join(
df2
, completeRainfall
)
Gives
Date Removal.Rate Rainfall_Duration prevSeven
1 2016-06-17 64.70 10 40
2 2016-06-30 22.63 0 0
3 2016-07-14 18.18 0 0
4 2016-08-19 27.87 40 70
Related
I have two data frames containing row entries with respective dates. Data frame 1 contains observations collected from 2010 to 2017.
dates A
2010-01-01 21
2010-01-02 27
2010-01-03 34
...
2017-12-29 22
2017-12-30 32
2017-12-31 25
Data frame 2 contains observations collected from 2015 to 2020.
dates A
2015-01-01 20
2015-01-02 29
2015-01-03 34
...
2020-12-29 22
2020-12-30 27
2020-12-31 32
Both the data frames have missing observations for some days. I wish to combine both data frames to fill out missing data and obtain complete time series upto 2020 without any repeated entries. Like the following data frame:
dates A
2010-01-01 21
2010-01-02 27
2010-01-03 34
...
2020-12-29 22
2020-12-30 27
2020-12-31 32
Using merge(df1, df2, by = 'dates') or full_join(df1, df2, by = 'dates') creates duplicate entries or two columns A.x and A.y which is not expected.
Try the code below
dfout <- unique(rbind(df1,df2))
dfout <- dfout[order(dfout$dates),]
Combine df1 and df2, if there are duplicate dates which are available in both the dataframes mean the A value and use complete to fill the missing dates.
library(dplyr)
library(tidyr)
df1 %>%
bind_rows(df2) %>%
mutate(dates = as.Date(dates)) %>%
group_by(dates) %>%
summarise(A = mean(A)) %>%
complete(dates = seq(min(date), max(date), by = 'day'))
If your df is really just two columns, you should be able to bind_rows, group_by, and distinct to remove duplicates.
library(dplyr)
df <- bind_rows(df1, df2) %>%
group_by(dates, A) %>%
distinct(dates)
Edit: This will not work if you have data that doesn't agree between the dataframes on a single date. If you have two records for 1/1/15 and they have different A values, those will both be retained.
I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
I have a csv file that is written like this
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50
I'd like R to produce something like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980
1/7/1980 30
Then I would like R to bring the last observation forward like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980 25
1/7/1980 30
I'd like two separate data.tables created one with just the actual data, then another with the last observation brought forward.
Thanks for all the help!
Edit: I also will need any NA's that are populated to changed to 0
You could also use tidyverse:
library(tidyverse)
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data) %>%
replace(., is.na(.), 0)
First 10 rows:
# A tibble: 104 x 2
Date Data
<date> <dbl>
1 1980-01-01 0
2 1980-01-02 0
3 1980-01-03 0
4 1980-01-04 0
5 1980-01-05 25
6 1980-01-06 25
7 1980-01-07 30
8 1980-01-08 30
9 1980-01-09 30
10 1980-01-10 30
I've used as a starting point the 1st day of the month and year of minimum date, and maximum the maximum date; this can be of course adjusted as needed.
EDIT: #Sotos has an even better suggestion for a more concise approach (by better usage of format argument):
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data)
The solution is:
create a data.frame with successive date
merge it with your original data.frame
use na.locf function from zoo to carry forward your data
Here is the code. I use lubridate to work with date.
library(lubridate)
df$Date <- mdy(df$Date)
successive <-data.frame(Date = seq( as.Date(as.yearmon(df$Date[1])), df$Date[length(df$Date)], by="days"))
successive is the vector of successive dates. Now the merging:
result <- merge(df,successive,all.y = T,on = "Date")
And the forward propagation:
library(zoo)
result$Data <- na.locf(result$Data,na.rm = F)
Date Data
1 1980-01-05 25
2 1980-01-06 25
3 1980-01-07 30
4 1980-01-08 30
5 1980-01-09 30
6 1980-01-10 30
7 1980-01-11 30
8 1980-01-12 30
9 1980-01-13 30
10 1980-01-14 30
11 1980-01-15 30
12 1980-01-16 30
13 1980-01-17 30
14 1980-01-18 30
15 1980-01-19 30
16 1980-01-20 30
17 1980-01-21 30
18 1980-01-22 30
19 1980-01-23 30
20 1980-01-24 30
21 1980-01-25 30
The data:
df <- read.table(text = "Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50", header = T)
Assuming that the result should start at the first of the month of the first date and end at the last date and that the input data frame is DF shown reproducibly in the Note at the end, convert DF to a zoo object z, create a grid of dates g merge them to give zoo objects z0 (with zero filling) and zz (with na.locf filling) and optionally convert back to data frames or else just leave it as is so you can use zoo for further processing.
library(zoo)
z <- read.zoo(DF, header = TRUE, format = "%m/%d/%Y")
g <- seq(as.Date(as.yearmon(start(z))), end(z), "day")
z0 <- merge(z, zoo(, g), fill = 0) # zero filled
zz <- na.locf0(merge(z, zoo(, g))) # na.locf filled
# optional
DF0 <- fortify.zoo(z0) # zero filled
DF2 <- fortify.zoo(zz) # na.locf filled
data.table
The question mentions data tables and if that refers to the data.table package then add:
library(data.table)
DT0 <- data.table(DF0) # zero filled
DT2 <- data.table(DF2) # na.locf filled
Variations
I wasn't clear on whether the question was asking for a zero filled answer and an na.locf filled answer or just an na.locf filled answer whose remaining NA values are 0 filled but assumed the former case. If you want to fill the NAs that are left in the na.locf filled answer then add:
zz[is.na(zz)] <- 0
If you want to end at the end of the last month rather than at the last date replace end(z) with as.Date(as.yearmon(end(z)), frac = 1) .
If you want to start at the first date rather than the first of the month of the first date replace as.Date(as.yearmon(start(z))) with start(z)
.
As an alternative to (3), to start at the first date and end at the last date we could simply convert to ts and back. Note that we need to restore Date class on the second line below since ts class cannot handle Date class directly.
z2.na <- as.zoo(as.ts(z))
time(z2.na) <- as.Date(time(z2.na))
zz20 <- replace(z2.na, is.na(z2.na), 0) # zero filled
zz2 <- na.locf0(z2.na) # na.locf filled
Note
Lines <- "
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50"
DF <- read.table(text = Lines, header = TRUE)
Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)
The end goal is to visualize the amount of a medication taken per day across a large sample of individuals. I'm trying to reshape my data to make a stacked area chart (or something similar).
In a more general term; I have my data structured as below:
id med start_date end_date
1 drug_a 2010-08-24 2011-03-03
2 drug_a 2011-06-07 2011-08-12
3 drug_b 2010-03-26 2010-10-31
4 drug_b 2012-08-14 2013-01-31
5 drug_c 2012-03-01 2012-06-20
5 drug_a 2012-04-01 2012-06-14
I think I'm trying to create a data frame with one row per date, and a column summing the total of patients (id) that are taking that drug on that day. For example, if someone is taking drug_a from 2010-01-01 to 2010-01-20, each of those drug-days should count.
Something like:
date drug_a drug_b drug_c
2010-01-01 5 0 10
2010-01-02 10 2 8
I'm functional with dplyr and tidyr, but unsure how to use spread with dates and durations.
I'd expand out the data to use all dates using a do loop:
library(dplyr)
library(tidyr)
library(zoo)
df %>%
group_by(id, med) %>%
do(with(.,
data_frame(
date = (start_date:end_date) %>% as.Date) ) ) %>%
group_by(date, med) %>%
summarize(frequency = n() ) %>%
spread(med, frequency)