I have a dataset that consists of groups with year, month, and day values. I want to filter the groups using tidyverse in R, such that I locate the latest month in the time series. Here is some example code.
dat = expand.grid(group = seq(1,5),year = seq(2016,2020),month=seq(1:12))
dat = dat[order(dat$group,dat$year,dat$month),]
dat$days=sample(seq(0,30),nrow(dat),replace=TRUE)
dat$year[dat$year==2020 & dat$month==12] = NA
dat = dat[complete.cases(dat),]
In this example, there are 5 groups with monthly data from 2016 - 2020. However, let's suppose group December is missing. Also, some days are missing in the dataset
I can grab December from 2019, but not sure how to include the days in the summary and filter by number of days in month. For example,
a = dat %>%
group_by(group,month) %>%
summarise(year = max(year))
gets the year, but I would like to add the correct days to the month and year. Does anyone know how to keep the days column? I don't want to average or get a minimum or anything.
We can use slice_max to return the full row based on the max value of 'year' for each grouping block
library(dplyr)
dat %>%
group_by(group, month) %>%
slice_max(year)
Related
I have the following sample dataframe. The first column is the month and the second column is the number of surveys conducted each month.
month = c(1,2,3,4,5,6,7,8,9,10,11,12)
surveys = c(4,5,3,7,3,4,4,4,6,1,1,7)
df = data.frame(month, surveys)
I want to calculate the average number of surveys from May - August, and then, the average number of surveys for the remaining months (Jan - April PLUS September - December).
How do I do this using the dplyr package?
Assuming the integers represent months, in dplyr, you could use group_by with a boolean TRUE/FALSE and find the mean with summarize:
df %>% group_by(MayAug = month %in% 5:8) %>% summarize(mean = mean(surveys))
# MayAug mean
# <lgl> <dbl>
#1 FALSE 4.25
#2 TRUE 3.75
I first create a new factor variable period with labels, then group_by period and summarise using mean
df %>%
mutate(period = factor(between(month, 5,8), labels = c("Other months", "May-Aug"))) %>%
group_by(period) %>%
summarise(mean_surveys = mean(surveys))
# A tibble: 2 × 2
period mean_surveys
<fct> <dbl>
1 Other months 4.25
2 May-Aug 3.75
First, you need to install the dplyr package if you haven't already:
install.packages("dplyr")
Then you can load the package and use the group_by() and summarize() functions to calculate the averages:
library(dplyr)
df <- data.frame(month, surveys)
may_aug_avg <- df %>%
filter(month >= 5 & month <= 8) %>%
summarize(average = mean(surveys))
remaining_months_avg <- df %>%
filter(!(month >= 5 & month <= 8)) %>%
summarize(average = mean(surveys))
The first line of code filters the dataframe to only include the months of May through August, and then calculates the average of the number of surveys for those months. The second line of code filters the dataframe to exclude the months of May through August, and then calculates the average of the number of surveys for the remaining months.
You can check the values of may_aug_avg, remaining_months_avg to access the averages.
Hope this helps!
I have data like the following :
library(lubridate)
library(dplyr)
library(data.table)
MWE <- data.table(
Date=rep(seq(ymd("2020-1-1"), ymd("2020-3-30"), by = "days"),each=6),
Country=rep(c("France","United States","Germany"),90*6),
TransportType=rep(c("Train","Cars"),each=3,90*3),
Value=rnorm(90*6,2,3)
)
I want to create a new variable, that is the mean of value :
By Country and Transport
By weekday
based on dates before March (but here for March too)
So the mean should be calculated on January and February, but in the database for the whole period.
I have managed to do the first two (or I think so, I am checking) :
MWE_2 <- MWE %>%
.[,JourSem:=weekdays(Date)] %>%
.[,Moyenne:=mean(Value),by=.(Country,JourSem,TransportType)]
But I am unsure how to pass another condition in that. I think I get it form this
MWE_3 <- MWE %>%
.[,JourSem:=weekdays(Date)] %>%
.[Date <= "2020-02-29",Moyenne:=mean(Value),by=.(Country,JourSem,TransportType)]
But I lack the value for March dates, which is logical, as they are filtered out, which is therefore not what I want.
We can first calculate mean for January and February month for each weekday and then join this data with March data.
library(data.table)
MWE[, JourSem:=weekdays(Date)]
d1 <- MWE[Date <= as.Date("2020-02-29")] %>%
.[, .(Moyenne = mean(Value)), JourSem]
MWE[Date > as.Date("2020-02-29")][d1, on = 'JourSem']
I have historical monthly data and need to perform rolling calculation. Price of each period will be compared to 3 years back date i.e. Current Price / Base Price. Here Base is 3 years past date. It will be rolling for each month. For every month it should be compared 3 years paste date. I am using lag function to find out past date. It returns NA before Jan-2013 which is correct.
I want to add additional criteria - if minimum date of combination of (Location, Asset, SubType) is post year 2010, it should be compared with minimum date of the combination. For example minimum date is Jan-2014 so all the prices after Jan-2014 should be compared with Jan-2014 (static base year).
You can read data from the code below -
library(readxl)
library(httr)
GET("https://sites.google.com/site/pocketecoworld/Trend_Sale%20-%20Copy.xlsx", write_disk(tf <- tempfile(fileext = ".xlsx")))
dff <- read_excel(tf)
My code -
dff <- dff %>% group_by(Location, Asset, SubType) %>%
mutate(BasePrice=lag(Price, 36),
Index = round(100*(Price/BasePrice), 1)) %>%
filter(Period >= '2013-01-31')
Do you mean something like this ?
library(dplyr)
dff %>%
group_by(Location, Asset, SubType) %>%
mutate(BasePrice= if(lubridate::year(min(Period)) > 2010)
Price[which.min(Period)] else lag(Price, 36),
Index = round(100*(Price/BasePrice), 1))
If minimum date in Period is after 2010 we select the Price of minimum Period value or use 3 year earlier Price as BasePrice.
First of all I have to describe my data set. It has three columns, where number 1 is country, number 2 is date (%Y-%m-%d), and number 3 is a value associated with each row (average hotel room prices). It continues like that in rows from 1990 to 2019. It works as such:
Country Date Value
France 2011-01-01 700
etc.
I'm trying to turn the date into years instead of the normal %Y-%m-%d format, so it will instead sum the average values for each country each year (instead of each month). How would I go about doing that?
I thought about summarizing the values totally for each country each year, but that is hugely tedious and takes a long time (plus the code will look horrible). So I'm wondering if there is a better solution for this problem that I'm not seeing.
Here is the task at hand so far. My dataset priceOnly shows the average price for each month. I have also attributed it to show only values not equal to 0.
diffyear <- priceOnly %>%
group_by(Country, Date) %>%
summarize(averagePrice = mean(Value[which(Value!=0.0)]))
You can use the lubridate package to extract years and then summarise accordingly.
Something like this:
diffyear <- priceOnly %>%
mutate(Year = year(Date)) %>%
filter(Value > 0) %>%
group_by(Country, Year) %>%
summarize(averagePrice = mean(Value, na.rm = TRUE))
And in general, you should always provide a minimal reproducible example with your questions.
I have a dataframe of approximately 10 million rows spanning about 570 days. After using striptime to convert the dates and times, the data looks like this:
date X1
1 2004-01-01 07:43:00 1.2587
2 2004-01-01 07:47:52 1.2585
3 2004-01-01 17:46:14 1.2586
4 2004-01-01 17:56:08 1.2585
5 2004-01-01 17:56:15 1.2585
I would like to compute the average value on each day (as in days of the year, not days of the week) and then plot them. Eg. Get all rows which have day "2004-01-01", compute average price, then do the same for "2004-01-2" and so on.
Similarly I would be interested in finding the average monthly value, or hourly price, but I imagine I can work these out once I know how to get average daily price.
My biggest difficulty here is extracting the day of the year from the date variable automatically. How can I cycle through all 365 days and compute the average value for each day, storing it in a list?
I was able to find the average value for day of the week using the weekdays() function, but I couldn't find anything similar for this.
Here's a solution using dplyr and lubridate. First, simplify the date by rounding it down to the nearest day-unit using floor_date (see below comment by thelatemail), then group_by date and calculate the mean value using summarize:
library(dplyr)
library(lubridate)
df %>%
mutate(date = floor_date(date)) %>%
group_by(date) %>%
summarize(mean_X1 = mean(X1))
Using the lubridate package, you can use a similar method to get the average by month, week, or hour. For example, to calculate the average by month:
df %>%
mutate(date = month(date)) %>%
group_by(date) %>%
summarize(mean_X1 = mean(X1))
And by hour:
df %>%
mutate(date = hour(date)) %>%
group_by(date) %>%
summarize(mean_X1 = mean(X1))
day of year in lubridate is
yday, as in
lubridate::yday(Sys.time())
because the size of data is big I recommend a data.table approach
library(lubridate)
library(data.table)
df$ydate=yday(df$date)
df=data.table(df)
df[,mean(X1),ydate]
if you want different days for different years as in 1Jan2004 and 1Jan2005
library(lubridate)
library(data.table)
df$ydate=ymd(df$date)
df=data.table(df)
df[,mean(X1),ydate]
Note -instead of using striptime to convert dates you could just use ymd_hms function from lubridate
Just to contribute, here is the solution to do it for multiple columns in your data frame. It consists of the same method as George, so a little more is added an using summarise:
new_df <- df %>% mutate(date = hour(date)) %>%
group_by(date) %>%
summarise(across(.cols = where(is.numeric), .fns = ~mean(.x, na.rm = TRUE))
In this case, in ".cols" it is specified that the operation be applied to all columns with numeric format (you can modify it for specific columns). In the ".fns" section you can put the operation you want to perform (mean, sd, etc.) and you can apply na.rm.
Greetings!