Filling in missing dates and adding "0's" - r

The code below produces the number of avalanches in SLC by each year-month during the ski season (Dec-Mar). Since this code gets the total each year-month, it does not add in the the year-months that had 0 avalanches. How do I fill in my table so it will provide all year-month?
# write the webscraper
library(XML)
library(RCurl)
library(dplyr)
avalanche<-data.frame()
avalanche.url<-"https://utahavalanchecenter.org/observations?page="
all.pages<-0:202
for(page in all.pages){
this.url<-paste(avalanche.url, page, sep="")
this.webpage<-htmlParse(getURL(this.url))
thispage.avalanche<-readHTMLTable(this.webpage, which=1, header=T,stringsAsFactors=F)
names(thispage.avalanche)<-c('Date','Region','Location','Observer')
avalanche<-rbind(avalanche,thispage.avalanche)
}
# subset the data to the Salt Lake Region
avalancheslc<-subset(avalanche, Region=="Salt Lake")
str(avalancheslc)
# convert the dates and get the total the number of avalanches
avalancheslc <- avalancheslc %>%
group_by(Date = format(as.yearmon(Date, "%m/%d/%Y"), "%Y-%m")) %>%
summarise(AvalancheTotal = n())
# pipe to only include Dec-Mar of each year
avalancheslc <- avalancheslc %>% filter(as.integer(substr(Date, 6, 7)) %in% c(12, 1:3))
# the data right now looks like this
Date AvalancheTotal
1980-01 1
1981-02 1
.
.
.
# the data needs to look like this
Date AvalancheTotal
1980-01 1
1980-02 0
1980-03 0
1980-12 0
1981-01 0
1981-02 1
1981-03 1

library("tidyverse")
library("lubridate")
# You data here...
# Simpler version
avalancheslc %>%
separate(Date, c("year", "month")) %>%
# Some years might be missing (no avalanches at all)
# We can fill in those with `full_seq` but
# `full_seq` works with numbers not characters
mutate(year = as.integer(year)) %>%
complete(year = full_seq(year, 1), month,
fill = list(AvalancheTotal = 0)) %>%
unite("Date", year, month, sep = "-")
# Alternative version (fills in all months, so needs filtering afterwards)
avalancheslc <- avalancheslc %>%
# In case `Date` needs parsing
mutate(Date = parse_date_time(Date, "%y-%m"))
# A full data frame of months
all_months <- avalancheslc %>%
expand(Date = seq(first(Date), last(Date), by = "month"))
# Join to `avalanches` and fill in with 0s
avalancheslc %>%
right_join(all_months) %>%
replace_na(list(AvalancheTotal = 0))

Related

Finding the first row after which x rows meet some criterium in R

A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))

Dplyr grouped percentages in different timeframes

I have data in the following format:
DATE GROUP EVENT ELIGIBLE
2021-3-9 A 1 1
2021-3-1 A 0 0
2021-3-1 B 0 1
2021-2-20 B 1 1
I would like to group the data by the GROUP column and then add three new columns that calculate by group the sum of (EVENT / ELIGIBLE) for the following time frames. Last 3 months, 3 months back to six months back, and the last year.
I have calculated the overall percentage without separate timeframes by doing the following:
grouped <- data %>%
filter(ELIGIBLE == 1 ) %>%
group_by(GROUP) %>%
mutate(count_Eligible = sum(ELIGIBLE == 1 )) %>%
mutate(count_events = sum(EVENT == 1 )) %>%
mutate(Percentage = round(100*count_events/count_Eligible,2))
I am wondering what the cleanest way would be to add the three different percentages within the timeframes. So far I have pulled the dates to do the filtering with the following code:
today <- Sys.Date()
three_month_lookback <- as.Date(today) - months(3)
six_month_lookback <- as.Date(today) - months(6)
one_year_lookback <- as.Date(today) - months(12)
We can create a function to do the calculation
library(dplyr)
library(purrr)
f1 <- function(data) {
data %>%
filter(ELIGIBLE == 1 ) %>%
group_by(GROUP) %>%
transmute(count_Eligible = sum(ELIGIBLE == 1 ),
count_events = sum(EVENT == 1 ),
Percentage = round(100*count_events/count_Eligible,2))
}
Then, loop over the 'lookback' periods, subset the data based on the 'DATE' column and apply the function
map2_dfr(list(three_month_lookback, six_month_lookback,
one_year_lookback) list(today(), three_month_lookback, today()),
~ data %>%
mutate(DATE = as.Date(DATE)) %>%
filter(DATE >= .x, DATE <= .y) %>%
f1(.), .id = 'grp'
)
If we need to combine by columns
map2(list(three_month_lookback, six_month_lookback,
one_year_lookback) list(today(), three_month_lookback, today()),
~ data %>%
mutate(DATE = as.Date(DATE)) %>%
filter(DATE >= .x, DATE <= .y) %>%
f1(.)
) %>%
reduce(full_join, by = "GROUP")

Get the No_intersection/Complementary part of several date's intervals

I want to get the missing part of several date's intervals in 2017.
here for example, each "id" of following dataframe:
df <- data.frame(id=c(rep("a",3),rep("b",2)),
start=c("2017-01-01","2017-01-10","2017-02-10","2017-03-01","2017-04-20"),
end=c("2017-01-15","2017-01-20","2017-02-20","2017-03-28","2017-04-29"))
id start end
a 2017-01-01 2017-01-15
a 2017-01-10 2017-01-20
a 2017-02-10 2017-02-20
b 2017-03-01 2017-03-28
b 2017-04-20 2017-04-29
I want to get:
df_final <- data.frame(id=c(rep("a",2),rep("b",3)),
start=c("2017-01-21","2017-02-21","2017-01-01","2017-03-29","2017-04-30"),
end=c("2017-02-09","2017-12-31","2017-02-28","2017-04-19","2017-12-31"))
id start end
a 2017-01-21 2017-02-09
a 2017-02-21 2017-12-31
b 2017-01-01 2017-02-28
b 2017-03-29 2017-04-19
b 2017-04-30 2017-12-31
Thank you!
First, confirm whether start and end are Date class.
df$start <- as.Date(df$start)
df$end <- as.Date(df$end)
Use by() to split the data into a list of two data frames according to the ids.
library(purrr)
by(df, df$id, function(x){
year <- seq(as.Date("2017-01-01"), as.Date("2017-12-31"), 1)
ind <- map2(x$start, x$end, function(start, end){
which(year < start | year > end)
}) %>% reduce(intersect)
gap <- which(diff(ind) > 1)
head <- ind[c(1, gap + 1)] ; tail <- ind[c(gap, length(ind))]
return(data.frame(id = unique(x$id), start = year[head], end = year[tail]))
}) %>% reduce(rbind)
Description:
year : All days in 2017.
ind : Get rid of the dates between start and end along the rows and the outcome represents the indices of missing dates.
gap : The discontinuous indices.
Output:
# id start end
# 1 a 2017-01-21 2017-02-09
# 2 a 2017-02-21 2017-12-31
# 3 b 2017-01-01 2017-02-28
# 4 b 2017-03-29 2017-04-19
# 5 b 2017-04-30 2017-12-31
I think my solution is still cumbersome. Hope to help you.
I encountered a similar problem recently, and I found that expanding the table to get one row for each relevant date, and then collapsing back down to ranges, was easier than trying to work out the correct logic from the range endpoints alone.
Here's how that approach would work. Alternatively, it might be possible to do something like this or this, but those approaches don't have the "not in range" issue you're dealing with.
library(dplyr)
library(fuzzyjoin)
library(lubridate)
df <- data.frame(id=c(rep("a",3),rep("b",2)),
start=c("2017-01-01","2017-01-10","2017-02-10","2017-03-01","2017-04-20"),
end=c("2017-01-15","2017-01-20","2017-02-20","2017-03-28","2017-04-29"))
# All the dates in 2017.
all.2017.dates = data.frame(date = seq.Date(as.Date("2017-01-01"), as.Date("2017-12-31"), by = "day"))
# Start by expanding the original dataframe so that we get one record for each
# id for each date in any of that id's ranges.
df.expanded = df %>%
# Convert the strings to real dates.
mutate(start.date = as.Date(start),
end.date = as.Date(end)) %>%
# Left join to 2017 dates on dates that are in the range of this record.
fuzzy_left_join(all.2017.dates,
by = c("start.date" = "date", "end.date" = "date"),
match_fun = list(`<=`, `>=`)) %>%
# Filter to distinct ids/dates.
select(id, date) %>%
distinct()
# Now, do an anti-join that gets dates NOT in an id's ranges, and collapse back
# down to ranges.
df.final = expand.grid(id = unique(df$id),
date = all.2017.dates$date) %>%
# Anti-join on id and date.
anti_join(df.expanded,
by = c("id", "date")) %>%
# Sort by id, then date, so that the lead/lag functions behave as expected.
arrange(id, date) %>%
# Check whether this record is an endpoint (i.e., is it adjacent to the
# previous/next record?).
mutate(prev.day.included = coalesce(date == lag(date) + 1 &
id == lag(id), F),
next.day.included = coalesce(date == lead(date) - 1 &
id == lag(id), F)) %>%
# Filter to just endpoint records.
filter(!prev.day.included | !next.day.included) %>%
# Fill in both start and end dates on "start" records. The start date is the
# date in the record; the end date is the date of the next record.
mutate(start.date = as.Date(ifelse(!prev.day.included, date, NA),
origin = lubridate::origin),
end.date = as.Date(ifelse(!prev.day.included, lead(date), NA),
origin = lubridate::origin)) %>%
filter(!is.na(start.date))
Here's my solution:
library(tidyverse)
library(lubridate)
library(wrapr)
df %>%
mutate_at(2:3, ymd) %>%
group_by(id) %>%
gather('start_end', 'date', start:end) %>%
mutate(date = if_else(start_end == 'start', min(date), max(date))) %>%
unique() %>%
mutate(
start = if_else(
start_end == 'start',
date %>% min() %>% year() %>% paste0('-01-01') %>% ymd(),
date
),
end = if_else(
start_end == 'end',
date %>% max() %>% year() %>% paste0('-12-31') %>% ymd(),
date
)) %>%
filter(start != end) %>%
select(id, start, end) %>%
mutate(supp = TRUE) %>%
bind_rows(mutate(df, supp = FALSE) %>% mutate_at(2:3, ymd)) %>%
arrange(id, start) %>%
mutate(rn = row_number()) %.>%
left_join(., mutate(., rn = rn - 1), by = c('id', 'rn')) %>%
na.omit() %>%
mutate(
start = case_when(
(start.y >= end.x) & !supp.x ~ end.x + 1,
(start.y >= end.x) & supp.x ~ start.x,
TRUE ~ as.Date(NA)
),
end = case_when(
(start.y >= end.x) & supp.y ~ end.y,
(start.y >= end.x) & !supp.y ~ start.y - 1,
TRUE ~ as.Date(NA)
)
) %>%
select(id, start, end) %>%
na.omit()

Losing the order of a date range when formatting the data in a table

Here is some random data:
library(dplyr)
library(tidyr)
channels <- c("Facebook", "Youtube", "SEM", "Organic", "Direct", "Email")
mts <- seq(from = last_month %m+% months(-23), to = last_month, by = "1 month") %>% as.Date()
dimvars <- expand.grid(Month = mts, Channel = channels)
rws <- nrow(dimvars)
Sessions <- round(rnorm(rws, 5000),0)
dataset <- cbind(dimvars, Sessions)
And here is a table using this random data:
sessionsTable <- dataset %>%
mutate(Month = format(Month, "%b-%Y")) %>%
gather(Key, Value, -Channel, -Month) %>%
spread(Month, Value) %>%
select(-Key) %>%
mutate_at(vars(-Channel), funs(. %>% round(0) %>% scales::comma()))
And here's what it looks like:
> sessionsTable
Channel Apr-2016 Apr-2017 Aug-2015 Aug-2016 Dec-2015 Dec-2016 Feb-2016 Feb-2017 Jan-2016 Jan-2017 Jul-2015 Jul-2016 Jun-2015 Jun-2016
1 Facebook 14,852 7,604 8,841 16,429 20,720 21,527 10,960 13,824 14,394 19,163 1,047 19,209 2,198 18,655
2 Youtube 11,566 12,452 3,410 15,940 67 3,197 3,068 20,082 1,408 14,042 7,599 17,469 834 17,779
3 SEM 23,606 28,385 14,868 22,406 20,225 15,713 20,064 20,589 27,807 15,978 13,601 24,072 14,018 26,584
4 Organic 12,796 40,916 14,882 22,058 13,388 33,316 12,723 37,252 12,694 36,058 18,322 16,574 21,868 14,915
5 Direct 36,057 19,386 55,336 27,534 46,702 24,804 40,207 26,838 44,965 26,310 56,514 30,411 48,851 25,816
6 Email 15,966 4,768 7,663 6,051 13,520 17,650 9,100 13,939 13,909 10,430 10,116 16,317 14,854 18,430
Mar-2016 Mar-2017 May-2016 May-2017 Nov-2015 Nov-2016 Oct-2015 Oct-2016 Sep-2015 Sep-2016
1 8,732 12,734 16,651 3,959 18,415 18,020 9,840 27,853 9,193 25,364
2 7,458 14,466 15,009 10,118 2,065 9,524 8,334 10,564 2,643 14,037
3 22,987 30,840 24,686 16,839 21,354 13,472 14,511 11,954 14,725 17,313
4 14,649 34,112 16,018 37,914 12,482 28,349 18,586 22,605 20,009 24,234
5 39,981 23,635 27,717 14,265 47,936 22,207 46,651 22,485 50,203 30,574
6 8,743 12,628 18,224 2,806 12,343 8,104 15,602 4,405 6,383 11,708
The problem is that it's not trended by data but instead by month Alphabetically. April 2016 then April 2017. What I want is April 2016 then May 2016, then June 2016 and so on.
If I leave out this line mutate(Month = format(Month, "%b-%Y")) %>%
The sorting goes tot he way I want it but then my Month column shows a full date e.g. 2015-06-01 and 2015-07-01 etc.
How can I have a trended table ordered by month in my preferred format e.g. "Apr-2016", "May-2016", "Jun-2016" etc?
If you want a vector to have pretty labels and ordering, you can use a factor. Specifically, you make an ordered factor whose values are the formatted months and whose levels are the unique formatted months in the right order.
sessionsTable <- dataset %>%
mutate(Month = ordered(
format(Month, "%b-%Y"),
levels = format(sort(unique(Month)), "%b-%Y")
)) %>%
gather(Key, Value, -Channel, -Month) %>%
spread(Month, Value) %>%
select(-Key) %>%
mutate_at(vars(-Channel), funs(. %>% round(0) %>% scales::comma()))

summarize weekly average using daily data in R

How to add one column price.wk.average to the data such that price.wk.average is equal to the average price of last week, and also add one column price.mo.average to the data such that it equals to the average price of last month? The price.wk.average will be the same for the entire week.
Dates Price Demand Price.wk.average Price.mo.average
2010-1-1 x x
2010-1-2 x x
......
2015-1-1 x x
jkl,
try to post reproducible examples. It will make it easier to help you. you can use dplyr:
library(dplyr)
df <- data.frame(date = seq(as.Date("2017-1-1"),by="day",length.out = 100), price = round(runif(100)*100+50,0))
df <- df %>%
group_by(week = week(date)) %>%
mutate(Price.wk.average = mean(price)) %>%
ungroup() %>%
group_by(month = month(date)) %>%
mutate(Price.mo.average = mean(price))
(Since I don't have enough points to comment)
I wanted to point out that Eric's answer will not distinguish average weekly price by year. Therefore, if you are interested in unique weeks (Week 1 of 2012 != Week 1 of 2015 ), you will need to do extra work to group by unique weeks.
df <- data.frame( Dates = c("2010-1-1", "2010-1-2", "2015-01-3"),
Price = c(50, 20, 40) )
Dates Price
1 2010-1-1 50
2 2010-1-2 20
3 2015-01-3 40
Just to keep your data frame tidy, I suggest converting dates to POSIX format then sorting the data frame:
library(lubridate)
df <- df %>%
mutate(Dates = lubridate::parse_date_time(Dates,"ymd")) %>%
arrange( Dates )
To group by unique weeks:
df <- df %>%
group_by( yw = paste( year(Dates), week(Dates)))
Then mutate and ungroup.
To group by unique months:
df <- df %>%
group_by( ym = paste( year(Dates), month(Dates)))
and mutate and ungroup.

Resources