I am looking at a stackoverflow post over here: R: Count Number of Observations within a group
Here, daily data is created and summed/grouped at monthly intervals (as well as weekly intervals):
library(xts)
library(dplyr)
#create data
date_decision_made = seq(as.Date("2014/1/1"), as.Date("2016/1/1"),by="day")
date_decision_made <- format(as.Date(date_decision_made), "%Y/%m/%d")
property_damages_in_dollars <- rnorm(731,100,10)
final_data <- data.frame(date_decision_made, property_damages_in_dollars)
# weekly
weekly = final_data %>%
mutate(date_decision_made = as.Date(date_decision_made)) %>%
group_by(week = format(date_decision_made, "%W-%y")) %>%
summarise( total = sum(property_damages_in_dollars, na.rm = TRUE), Count = n())
# monthly
final_data %>%
mutate(date_decision_made = as.Date(date_decision_made)) %>%
group_by(week = format(date_decision_made, "%Y-%m")) %>%
summarise( total = sum(property_damages_in_dollars, na.rm = TRUE), Count = n())
It seems that the "format" statement in R (https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/format) is being used to instruct the computer to "group and sum" the data some fixed interval.
My question: is there a way to "instruct" the computer to "group and sum" by irregular intervals? E.g. by 11 day periods, by 3 month periods, by 2 year periods? (I guess 3 months can be written as 90 days...2 years can be written as 730 days).
Is this possible?
Thanks
You can use lubridate's ceiling_date/floor_date to create groups at irregular intervals.
library(dplyr)
library(lubridate)
final_data %>%
mutate(date_decision_made = as.Date(date_decision_made)) %>%
group_by(group = ceiling_date(date_decision_made, '11 days')) %>%
summarise(amount = sum(property_damages_in_dollars))
You can also specify intervals like ceiling_date(date_decision_made, '3 years') or ceiling_date(date_decision_made, '2 months').
Using data.table
library(data.table)
library(lubridate)
setDT(final_data)[, .(amount = sum(property_damages_in_dollars)),
,.(group = ceiling_date(as.IDate(date_decison_made), "11 days"))]
Related
I have a dataset on a group of individuals that was collected starting at different times for each individual.
I need to subset the data from 1 year since their first entry, like so: myData[myDate >= "first entry" & myDate += "1 year"]
Example data:
df_date <- data.frame( Name = c("Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
"Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"),
Dates = c("2010-1-1", "2010-2-2", "2010-3-5","2010-4-17","2010-5-20",
"2010-6-29","2010-7-6","2010-8-9","2010-9-16","2010-10-28","2010-11-16","2010-12-28","2011-1-16","2011-2-28",
"2010-4-1", "2010-5-2", "2010-6-5","2010-7-17","2010-8-20",
"2010-9-29","2010-10-6","2010-11-9","2012-12-16","2011-1-28","2011-2-28","2011-3-28","2011-2-28","2011-3-28"),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) )
The desired output would be Jim would have data from 1/1/2010 - 12/28/2010 and Sue from 4/4/2010 - 3/28/2011 and so on. The actual dataset had > 20 samples, all starting at different times.
Use a combination of tidyverse and lubridate functions:
library(tidyverse)
library(lubridate)
df_date %>%
mutate(Dates = as_datetime(Dates)) %>%
group_by(Name) %>%
arrange(Dates, .by_group = T) %>%
filter(Dates <= first(Dates) + duration(1, units = "year"))
Similar to Martin C. Arnold's answer, I got another answer based on dplyr and lubridate. min(Dates) + years(1) means add one year to the minimum date.
library(dplyr)
library(lubridate)
df_date2 <- df_date %>%
mutate(Dates = ymd(Dates)) %>%
group_by(Name) %>%
filter(Dates <= min(Dates) + years(1)) %>%
ungroup()
A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))
I'm using this script to download data from google trends. However,it doesn't print the last 3 days. In other words, I got results until 28/09/2020, and now it's 01/10/2020.
Is there a way to download even more recent data?
Thank you.
Note: the script is retrived from here.
library(gtrendsR)
library(tidyverse)
library(lubridate)
get_daily_gtrend <- function(keyword = 'Taylor Swift', geo = 'UA', from = '2013-01-01', to = '2019-08-15') {
if (ymd(to) >= floor_date(Sys.Date(), 'month')) {
to <- floor_date(ymd(to), 'month') - days(1)
if (to < from) {
stop("Specifying \'to\' date in the current month is not allowed")
}
}
mult_m <- gtrends(keyword = keyword, geo = geo, time = paste(from, to))$interest_over_time %>%
group_by(month = floor_date(date, 'month')) %>%
summarise(hits = sum(hits)) %>%
mutate(ym = format(month, '%Y-%m'),
mult = hits / max(hits)) %>%
select(month, ym, mult) %>%
as_tibble()
pm <- tibble(s = seq(ymd(from), ymd(to), by = 'month'),
e = seq(ymd(from), ymd(to), by = 'month') + months(1) - days(1))
raw_trends_m <- tibble()
for (i in seq(1, nrow(pm), 1)) {
curr <- gtrends(keyword, geo = geo, time = paste(pm$s[i], pm$e[i]))
print(paste('for', pm$s[i], pm$e[i], 'retrieved', count(curr$interest_over_time), 'days of data'))
raw_trends_m<- rbind(raw_trends_m,
curr$interest_over_time)
}
trend_m <- raw_trends_m %>%
select(date, hits) %>%
mutate(ym = format(date, '%Y-%m')) %>%
as_tibble()
trend_res <- trend_m %>%
left_join(mult_m, by = 'ym') %>%
mutate(est_hits = hits * mult) %>%
select(date, est_hits) %>%
as_tibble() %>%
mutate(date = as.Date(date))
return(trend_res)
}
get_daily_gtrend(keyword = 'Taylor Swift', geo = 'UA', from = '2013-01-01', to = '2019-08-15')
This is how Google Trends data works. Even if you go to the website and download data for anything beyond the last 7 days up to the last 90 days, it will give you daily data up to three days ago. So it is by design.
I'm not certain whether gTrendsR retrieves hourly data, but you can either manually retrieve data for the last 7 days from the website to get hourly data right up to a few hours before your request, or use the PyTrends package, which can return hourly date. If you then have the hourly data, you can, of course, aggregate it easily to daily.
I have seen many questions in SO on converting daily data to weekly using xts, zoo or lubridate packages. None of the answers was found appropriate for my problem. I have tried the following code
library(zoo)
library(lubridate)
library(xts)
library(tidyverse)
#Calculation for multistation
set.seed(123)
df <- data.frame("date"= seq(from = as.Date("1970-1-1"), to = as.Date("2000-12-31"), by = "day"),
"Station1" = runif(length(seq.Date(as.Date("1970-1-1"), as.Date("2000-12-31"), "days")), 10, 30),
"Station2" = runif(length(seq.Date(as.Date("1970-1-1"), as.Date("2000-12-31"), "days")), 11, 29),
"Station3" = runif(length(seq.Date(as.Date("1970-1-1"), as.Date("2000-12-31"), "days")), 9, 28))
head(df)
# Aggregate over week
df %>%
mutate(Week = week(ymd(date)),
Year = year(ymd(date))) %>%
pivot_longer(-c(Week, date, Year), values_to = "value", names_to = "Station") %>%
group_by(Year, Week, Station) %>%
summarise(Weekly = mean(value)) %>%
arrange(Station) %>%
print(n = 55)
From the output you can see that 1970 cotains 53 weeks which I don't want. I want to start the week from the first date of every year and the 52nd week should have 8 days in a nonleap year and in case of leap years 9th and 52nd week should have 8 days so that every year contains 52 weeks only. How to do that in R?
Why not just write a function that gives the meteorological week from the definition you gave? Package lubridate will give you the day of the year with yday, which can act as the index for a vector of the correct week labels. These are straightforward to construct with simple modular math and concatenation.
You then only need to figure out if you are in a leap year, which again is possible using lubridate::leap_year. Combine these in an ifelse and you have an easy-to-use function:
met_week <- function(dates)
{
normal_year <- c((0:363 %/% 7 + 1), 52)
leap_year <- c(normal_year[1:59], 9, normal_year[60:365])
year_day <- lubridate::yday(dates)
return(ifelse(lubridate::leap_year(dates), leap_year[year_day], normal_year[year_day]))
}
and you can do
df %>% mutate(week = met_week(date))
You could just do it manually on the day of the year, not sure there is a function already built for that.
df %>%
mutate(Week = pmin(52, ceiling(yday(date) / 7)),
Year = year(ymd(date)))
How to add one column price.wk.average to the data such that price.wk.average is equal to the average price of last week, and also add one column price.mo.average to the data such that it equals to the average price of last month? The price.wk.average will be the same for the entire week.
Dates Price Demand Price.wk.average Price.mo.average
2010-1-1 x x
2010-1-2 x x
......
2015-1-1 x x
jkl,
try to post reproducible examples. It will make it easier to help you. you can use dplyr:
library(dplyr)
df <- data.frame(date = seq(as.Date("2017-1-1"),by="day",length.out = 100), price = round(runif(100)*100+50,0))
df <- df %>%
group_by(week = week(date)) %>%
mutate(Price.wk.average = mean(price)) %>%
ungroup() %>%
group_by(month = month(date)) %>%
mutate(Price.mo.average = mean(price))
(Since I don't have enough points to comment)
I wanted to point out that Eric's answer will not distinguish average weekly price by year. Therefore, if you are interested in unique weeks (Week 1 of 2012 != Week 1 of 2015 ), you will need to do extra work to group by unique weeks.
df <- data.frame( Dates = c("2010-1-1", "2010-1-2", "2015-01-3"),
Price = c(50, 20, 40) )
Dates Price
1 2010-1-1 50
2 2010-1-2 20
3 2015-01-3 40
Just to keep your data frame tidy, I suggest converting dates to POSIX format then sorting the data frame:
library(lubridate)
df <- df %>%
mutate(Dates = lubridate::parse_date_time(Dates,"ymd")) %>%
arrange( Dates )
To group by unique weeks:
df <- df %>%
group_by( yw = paste( year(Dates), week(Dates)))
Then mutate and ungroup.
To group by unique months:
df <- df %>%
group_by( ym = paste( year(Dates), month(Dates)))
and mutate and ungroup.