I think I have a pretty simple request. I have the following dataframe, where "place" is a unique identifier, while start_date and end_date may overlap. The values are unique for each ID "place".
place start_date end_date value
1 2007-09-01 2010-10-12 0.5
2 2013-09-27 2015-10-11 0.7
What I need is to create a year-based variable, where I expand the time series by each year (starting from first of January (i.e. 2011-01-01) starts a new row for that particular "place" and "value". I mean something like this:
place year value
1 2007 0.5
1 2008 0.5
1 2009 0.5
1 2010 0.5
2 2013 0.7
2 2014 0.7
2 2015 0.7
There are some cases with overlap (ie. "place"=1 & "year"=2007) for two separate cases, where one observations starts with one year and the other observation continues from that year. In that case I would prefer the "value" that ends on that specific year. So if one observation for place=1 ends with 2007 in March and another place=1 starts with 2007 in April, year=2007 value for place=1 would be marked with the previous "ending" value if that makes sense.
I've only gotten this far:
data <- data.table(dat)
data[,:=(start_date = as.Date(start_date), end_date = as.Date(end_date))]
data[,num_mons:= length(seq(from=start_date, to=end_date, by='year')),by=1:nrow(data)]
I guess writing a loop makes the most sense?
Thank you for your help and advice.

Using a tidyverse solution could look like:
data <- tibble(place = c(1, 2),
start_date = c('2007-09-01',
end_date = c('2010-10-12',
value = c(0.5, 0.7))
data %>%
mutate(year = map2(start_date,
~ as.character(str_extract(.x, '\\d{4}'):
str_extract(.y, '\\d{4}')))) %>%
separate_rows(year) %>%
filter(!year %in% c('c', '')) %>%
select(place, year, value)
# place year value
# <dbl> <chr> <dbl>
# 1 1 2007 0.5
# 2 1 2008 0.5
# 3 1 2009 0.5
# 4 1 2010 0.5
# 5 2 2013 0.7
# 6 2 2014 0.7
# 7 2 2015 0.7
I'm having problems understanding the third paragraph of your question ("There are ..."). It seems to me to be a separate question. If that is the case, please consider moving the question to a separate post here on SO. If it is not a separate question, please reformulate the paragraph.

You could do the following:
df %>%
group_by(place) %>%
mutate(year = list(seq(year(ymd(start_date)), year(ymd(end_date)))))%>%
# A tibble: 7 x 3
# Groups: place [2]
place year value
<int> <int> <dbl>
1 1 2007 0.5
2 1 2008 0.5
3 1 2009 0.5
4 1 2010 0.5
5 2 2013 0.7
6 2 2014 0.7
7 2 2015 0.7


summing based on conditions from two dataframes and dealing with dates

I have two dataframes, one with climate data for every location and date across 4 years. The other data frame has a date for each day an animal was trapped at a site. I am trying to calculate the mean of each climate variable based on a specific amount of time before the day the animal was trapped (time length depends on variable in question).
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
trap <- data.frame(site=c(1,2,3,3), date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
> climate
site precip humid date
1 1 0.1 1 6/13/2020
2 1 0.2 1 6/12/2020
3 1 0.1 3 6/11/2020
4 1 0.1 1 6/14/2020
5 2 0.5 2 6/13/2020
6 2 0.2 3 6/12/2020
7 2 0.3 3 6/11/2020
8 2 0.1 1 6/14/2020
9 1 0.2 1 2/13/2019
10 1 0.1 3 2/14/2019
11 1 0.1 1 2/15/2019
12 1 0.5 2 2/16/2019
> trap
site date
1 1 7/1/2020
2 2 7/1/2020
3 3 7/2/2020
4 3 7/4/2020
I want to calculate the mean humid 18-20 days before the date written in the trap dataframe. So essentially what is the mean humid between 6/11/2020 and 6/13/2020 according to the climate data.frame for animals trapped on 7/1/2020. So for site 1 that would be: 1.667 and site 2 that would be 2.67.
I also want to calculate the sum of precipitation 497-500 days before the date written in the trap dataframe. So I would need to calculate the sum (total) precip between 2/13/2019 and 2/16/2019 for an animal trapped on 7/1/2020 at each site. So for site 1 precip would be 0.9.
I know how to create new columns in the trap data frame for mean precip and sum humid but I'm not sure where to start in terms of coding so that each value is calculated as described above and the data that corresponds to the correct date is used for the large dataset that contains many different trap dates.
Thank you very much, hopefully I am being clear in my description.
I have a solution using functions from the tidyverse. It is always useful to convert date variables to the class date. With this class, you can make calculations. Note, that I renamed the date column in the trap data to trap_date. See comments for more details:
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
trap <- data.frame(site=c(1,2,3,3), trap_date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
# merge data
data <- merge(climate, trap, by="site")
> head(data)
site precip humid date trap_date
1 1 0.1 1 2020-06-13 2020-07-01
2 1 0.2 1 2020-06-12 2020-07-01
3 1 0.1 3 2020-06-11 2020-07-01
4 1 0.1 1 2020-06-14 2020-07-01
5 1 0.2 1 2019-02-13 2020-07-01
6 1 0.1 3 2019-02-14 2020-07-01
# parse dates to class 'date'; enables calculations
data <- data %>%
mutate(date = parse_date(date, format="%m/%d/%Y"),
trap_date = parse_date(trap_date, format="%m/%d/%Y"))
For means:
# humid means
data %>%
group_by(site) %>%
filter(date >= trap_date-20 & date <= trap_date-18) %>%
summarise(mean = mean(humid))
# A tibble: 2 x 2
site mean
<dbl> <dbl>
1 1 1.67
2 2 2.67
However, it seems that the range of 497 to 500 days before the trap date contains no observations. When I used your specified dates, I got the same result of 0.9:
# precip sums
data %>%
group_by(site) %>%
filter(date >= trap_date-500 & date <= trap_date-497)
# A tibble: 0 x 5
# Groups: site [0]
# ... with 5 variables: site <dbl>, precip <dbl>, humid <dbl>,
# date <date>, trap_date <date>
# using your provided dates
data %>%
group_by(site) %>%
filter(date >= as.Date("2019-02-13") & date <= as.Date("2019-02-16")) %>%
summarise(sum = sum(precip))
# A tibble: 1 x 2
site sum
<dbl> <dbl>
1 1 0.9
Hope I can help.

How do I go about filtering my data by the upper 50th percentile for a separate dependent variable?

I need to split my data so that when I use the facet_wrap I have the top 50 percentile for each year.
Here is a sample of my data:
# A tibble: 10,519 x 3
Species Abundance Year
<chr> <dbl> <chr>
1 Astropecten irregularis 2 2009
2 Asterias rubens 14 2009
3 Echinus esculentus 1 2009
4 Pagurus prideaux 1 2009
5 Raja clavata 1 2009
6 Astropecten irregularis 4 2009
7 Asterias rubens 47 2009
8 Henricia sp. 2 2009
9 Ophiura ophiura 8 2009
10 Solaster endeca 1 2009
# ... with 10,509 more rows
My current strategy is this:
Data <- All_years %>%
group_by(Species, Year) %>%
summarise(Abundance = sum(Abundance, na.rm = TRUE)) %>%
filter(quantile(Abundance, 0.50)<Abundance) %>%
filter(Abundance > 50)
The issue is that this gives me the top 50 percentile for the whole set while I would like it to give me the top 50 for each year so I can then display it with a facet_wrap in ggplot.

Assign day of the day year to a month

Sample data
df <- data.frame(ID1 = rep(1:1000, each= 5*365), year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5))
This data contains a column day which is the day of the year. I need to produce two columns:
Month column: a column of month (which month does the day belong)
Biweek column: which biweek does a day belong to. There are 24 biweek in a year. All days <= 15 in a month is the first biweek and > 15 is second biweek.
For e.g.
15th Jan is Biweek 1,
16-31 Jan is biweek 2,
1-15 Feb is biweek 3 and
16-28 Feb is biweek 4 and so on.
For sake of simplicity, I am assuming all the years are non-leap years.
Here's the code I have (with help from RS as well) that creates the two columns.
# create a vector of days for each month
months <- list(1:31, 32:59, 60:90, 91:120, 121:151, 152:181, 182:212, 213:243, 244:273, 274:304, 305:334, 335:365)
ptm <- proc.time()
df <- df %>% mutate(month = sapply(day, function(x) which(sapply(months, function(y) x %in% y))), # this assigns each day to a month
date = as.Date(paste0(year,'-',format(strptime(paste0('1981-',day), '%Y-%j'), '%m-%d'))), # this creates a vector of dates for a non-leap year
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
proc.time() - ptm
user system elapsed
121.71 0.31 122.43
My issue is that of the time it takes to run this script and I am looking for a solution that is relatively faster
EDIT: To be clear, I have assumed all years must have 365 days. In one of the answers below, for the year 2000 (a leap year), Feb has 29 days (last day of Feb is 60 but I want the last day to be 59) and therefore Dec has only 30 days (Dec start with 336 though it should start with 335). I hope this is clear. My solution addresses this issue but takes lot of time to run.
Here is a solution using lubridate extractors and replacement functions as mentioned by Frank in a comment. The key ones are yday<-, mday() and month(), which respectively set the day of year of a date, get the day of month of a date, and get the month of a date. 8 sec running time seems pretty acceptable to me, though I'm sure some optimising could shave that down though there might be a loss of generality.
Note also the use of case_when to ensure the correct numbering of days after Feb 29 on a leap year.
EDIT: Here is a significantly faster solution. You can just get the mapping of DOYs to months and biweeks for a single year, and then left_join to the main table. 0.36s running time, since you no longer have to repetitively create the date. We also bypass having to use case_when, since the join will take care of the missing days. See that Day 59 of year 2000 is February and Day 60 is March, as requested.
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> date
tbl <- tibble(
ID1 = rep(1:1000, each= 5*365),
year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5)
doys <- tibble(
day = rep(1:365),
date = seq.Date(ymd("2001-1-1"), ymd("2001-12-31"), by = 1),
month = month(date),
biweek = case_when(
mday(date) <= 15 ~ (month * 2) - 1,
mday(date) > 15 ~ month * 2
tbl_out2 <- left_join(tbl, select(doys, -date), by = "day")
#> : 0.36 sec elapsed
#> # A tibble: 1,825,000 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 1 0.331 1. 1.
#> 2 1 2000 2 0.284 1. 1.
#> 3 1 2000 3 0.627 1. 1.
#> 4 1 2000 4 0.762 1. 1.
#> 5 1 2000 5 0.460 1. 1.
#> 6 1 2000 6 0.500 1. 1.
#> 7 1 2000 7 0.340 1. 1.
#> 8 1 2000 8 0.952 1. 1.
#> 9 1 2000 9 0.663 1. 1.
#> 10 1 2000 10 0.385 1. 1.
#> # ... with 1,824,990 more rows
tbl_out2[55:65, ]
#> # A tibble: 11 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 55 0.127 2. 4.
#> 2 1 2000 56 0.779 2. 4.
#> 3 1 2000 57 0.625 2. 4.
#> 4 1 2000 58 0.245 2. 4.
#> 5 1 2000 59 0.640 2. 4.
#> 6 1 2000 60 0.423 3. 5.
#> 7 1 2000 61 0.439 3. 5.
#> 8 1 2000 62 0.105 3. 5.
#> 9 1 2000 63 0.218 3. 5.
#> 10 1 2000 64 0.668 3. 5.
#> 11 1 2000 65 0.589 3. 5.
Created on 2018-04-06 by the reprex package (v0.2.0).
You can speed this up almost an order of magnitude by defining date first, reducing redundancy in the date call, and then extracting month from date.
ptm <- proc.time()
df <- df %>% mutate(
date = as.Date(paste0(year, "-", day), format = "%Y-%j"), # this creates a vector of dates
month = as.numeric(format(date, "%m")), # extract month
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
proc.time() - ptm
# user system elapsed
# 18.58 0.13 18.75
Versus original version in the question
# user system elapsed
# 117.67 0.15 118.45
Filtered for one year. I think it solves the leap issue you described, unless I'm not clear on what you're saying. Last day of Feb is 59 in the df in my result below, but only because day is 0 indexed.
df2000 <- filter(df, year == "2000")
ptm <- proc.time()
df2000 <- df2000 %>% mutate(
day = day - 1, # dates are 0 indexed
date = as.Date(day, origin = "2000-01-01"),
month = as.numeric(as.POSIXlt(date, format = "%Y-%m-%d")$mon + 1),
bis = month * 2 - (as.numeric(format(date, "%d")) <= 15)
proc.time() - ptm
user system elapsed
0.8 0.0 0.8
One year is 0.2 of the whole df, so times reflect that.

Grouping the Data in a data frame based on conditions from more than 1 columns

Problem Description :
I am trying to calculate the recency , based on , what is the most recent value in Year column where the target achieved indicator was equal to 1 and in case the indicator column has 0 as the only available value for the Salesman + Year key, choose the minimum year in that case
Salesman_ID Year Yearly_Targets_Achieved_Indicator
1 AA-5468 2012 1
2 AA-5468 2013 0
3 AA-5468 2014 0
4 AA-5468 2015 0
5 AA-5468 2016 1
6 AL-3791 2012 1
7 AL-3791 2013 1
8 AL-3791 2014 0
9 AL-3893 2015 0
10 AL-3893 2016 0
Expected Output:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
<chr> <dbl> <dbl>
1 AA-5468 2016 1
2 AA-3791 2013 1
9 AL-3893 2015 0
Using the package tidyverse I suggest you the following code:
Prashant_df <- data.frame(
names(Prashant_df) <- c("Salesman_ID","Year","Yearly_Targets_Achieved_Indicator")
Prashant_df <- Prashant_df %>%
group_by(Salesman_ID) %>%
Yearly_Targets_Achieved_Indicator==1 ~ max(Year),
Yearly_Targets_Achieved_Indicator==0 ~ min(Year)
Prashant_df_collapsed <- Prashant_df %>%
group_by(Salesman_ID) %>%
You can store both maximum and minimum year for each salesman, and the maximum of your binary variable.
newdf = df %>% group_by(Salesman_ID) %>% summarise(
maximum = max(Year),
minimum = min(Year),
maxInd = max(Yearly_Targets_Achieved_Indicator))
From this you can pretty much construct your resulting variable.
Using Base R:
c(by(dat,dat[1],function(x)if(all(x[,3]==0)) x[1,2] else max(x[which(x[,3]==1),2])))
AA-5468 AL-3791 AL-3893
2016 2013 2015
This code is kind of a messy but produces the desired output: Here is the explanation:
first groupby salesman_id, then for that specific group check whether all the indicators are zero, if yes, return the first year. else, look for the latest/maximum year among those which the indicators are 1

Manipulating Dates with dplyr

I have longitudinal, geocoded address data and the length of time at each geocode. I then have a series of variables (I'm just calling them x here) that give characteristics of each geoid location. Below here is just two cases but I have thousands.
id geoid x start end
1 53 0.5 2004-01-01 2004-10-30
1 45 0.7 2004-10-31 2004-12-31
1 45 0.7 2005-01-01 2007-12-31
7 16 0.3 2005-01-01 2007-05-31
7 18 0.4 2007-06-01 2007-08-01
7 42 0.6 2007-08-02 2007-12-31
I need to end up with a single value for each year (2004, 2005, 2006, 2007) and for each case (1, 7) that is weighted by the length of time at each address. So case 1 moves from geoid 53 to 45 in 2004 and case 7 moves from geoid 16 to 18 to 42 in 2007. So I calculate the percent of the year at each geoid (and eventually I will multiply that by x and take the mean for each year to get a weighted average). Cases staying put for a whole year will get a weight of 1.
#calculate the percentage of year at each address for id 1
Time difference of 0.8323066
Time difference of 0.1670089
#calculate the percentage of year at each address for id 7
Time difference of 0.4106776
Time difference of 0.08213552
Time difference of 0.4982888
I can do this by brute force by looking at each year individually, calculating the percent of the year spent at that address. Then I would multiply each weight by the x values and take the mean for that year - that will not be reasonably possible to do with thousands of cases. Any ideas of how to address this more efficiently would be much appreciated. Seems like it might be doable with dplyr slice but I'm stalled out at the moment. The key is separating out each year.
As eipi10 mentioned, some of your data spans more than a year. It also looks inconsistent with the data you used in your time difference calculations, which are all within the same year.
Assuming that your start and end dates would actually be in the same year, you can do something like the following:
foo <- dat %>%
year_frac=as.numeric(dat$end - dat$start)/365.25,
wtd_x = year_frac * x)
This gives you:
id geoid x start end start_year end_year same_year year_frac wtd_x
1 1 53 0.5 2004-01-01 2004-10-31 2004 2004 TRUE 0.83230664 0.41615332
2 1 45 0.7 2004-10-31 2004-12-31 2004 2004 TRUE 0.16700890 0.11690623
3 1 45 0.7 2005-01-01 2007-12-31 2005 2007 FALSE 2.99520876 2.09664613
4 7 16 0.3 2007-01-01 2007-05-31 2007 2007 TRUE 0.41067762 0.12320329
5 7 18 0.4 2007-06-01 2007-07-01 2007 2007 TRUE 0.08213552 0.03285421
6 7 42 0.6 2007-07-02 2007-12-31 2007 2007 TRUE 0.49828884 0.29897331
You can then group and summarise the data using:
bar <- foo %>%
group_by(start_year, id) %>%
to give you the answer:
start_year id sum(wtd_x)
(dbl) (dbl) (dfft)
1 2004 1 0.5330595 days
2 2005 1 2.0966461 days
3 2007 7 0.4550308 days
Hopefully this will get you started. I wasn't sure how you wanted to deal with cases where the period from start to end spans more than one year or crosses calendar years.
dat %>%
mutate(fractionOfYear = as.numeric(end - start)/365.25)
id geoid x start end fractionOfYear
1 1 53 0.5 2004-01-01 2004-10-30 0.82956879
2 1 45 0.7 2004-10-31 2004-12-31 0.16700890
3 1 45 0.7 2005-01-01 2007-12-31 2.99520876
4 7 16 0.3 2005-01-01 2007-05-31 2.40930869
5 7 18 0.4 2007-06-01 2007-07-01 0.08213552
6 7 42 0.6 2007-07-02 2007-12-31 0.49828884
I was able to find some local help that led us to a simple function. We're still stuck on how to use apply with dates but this overall handles it.
#made up sample address data
dat <- data.frame(id,geoid,start,end)
#format addresses
#function to create proportion of time at each address
prop_time <- function(drange, year){
start <- drange[[1]]; end <- drange[[2]]
#start year and end year
syear <- as.numeric(format(start,'%Y'))
eyear <- as.numeric(format(end,'%Y'))
#select only those dates that are within the same year
if(syear<=year & year<=eyear){
byear <- as.Date(paste("1/1", sep="/", year), format='%m/%d/%Y')
eyear <- as.Date(paste("12/31", sep="/", year), format='%m/%d/%Y')
astart <- max(byear, start)
aend <- min(eyear, end)
prop <- as.numeric((aend - astart))/as.numeric((eyear - byear))
} else prop <- 0 #if no proportion within same year calculated then gets 0
#a second function to apply prop_time to multiple cases
prop_apply <- function(dat_times, year){
out <- NULL
for(i in 1:dim(dat_times)[1]){
out <- rbind(out,prop_time(dat_times[i,], year))
#create new data frame to populate years
dat <- data.frame(dat, y2004=0, y2005=0, y2006=0, y2007=0)
dat_times <- dat[,c("start", "end")]
#run prop_apply in a loop across cases and selected years
for(j in 2004:2007){
newdate <- paste("y", j, sep="")
dat[,newdate] <- prop_apply(dat_times, j)
