Please see attached image of dataset.
What are the different ways to only retain a single value for each 'Month'? I've got a bunch of data points and would only need to retain, say, the mean value.
Many thanks
A different way of using the aggregate() function.
> aggregate(Temp ~ Month, data=airquality, FUN = mean)
Month Temp
1 5 65.54839
2 6 79.10000
3 7 83.90323
4 8 83.96774
5 9 76.90000
library(tidyverse)
library(lubridate)
#example data from airquality:
aq<-as_data_frame(airquality)
aq$mydate<-lubridate::ymd(paste0(2018, "-", aq$Month, "-", aq$Day))
> aq
# A tibble: 153 x 7
Ozone Solar.R Wind Temp Month Day mydate
<int> <int> <dbl> <int> <int> <int> <date>
1 41 190 7.40 67 5 1 2018-05-01
2 36 118 8.00 72 5 2 2018-05-02
3 12 149 12.6 74 5 3 2018-05-03
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE))
Summarize can return multiple summary functions:
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE),
"Num" = n(),
"SD" = sd(Temp, na.rm=TRUE))
# A tibble: 5 x 4
Month Mean_Temp Num SD
<dbl> <dbl> <int> <dbl>
1 5.00 65.5 31 6.85
2 6.00 79.1 30 6.60
3 7.00 83.9 31 4.32
4 8.00 84.0 31 6.59
5 9.00 76.9 30 8.36
Lubridate Cheatsheet
A data.table answer:
# load libraries
library(data.table)
library(lubridate)
setDT(dt)
dt[, .(meanValue = mean(value, na.rm =TRUE)), by = .(monthDate = floor_date(dates, "month"))]
Where dt has at least columns value and dates.
We can group by the index of dataset, use that in aggregate (from base R) to get the mean
aggregate(dat, index(dat), FUN = mean)
NB: Here, we assumed that the dataset is xts or zoo format. If the dataset have a month column, then use
aggregate(dat, list(dat$Month), FUN = mean)
Related
I have a data set containing climatic data taken hourly from 01-01-2007 to 31-12-2021.
I want to calculate the mean value for a given variable (e.g. temperature) for each day of the year (1:365).
My dataset look something like this:
dia prec_h tc_h um_h v_d vm_h
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2007-01-01 0.2 22.9 89 42 3
2 2007-01-01 0.4 22.8 93 47 1.9
3 2007-01-01 0 22.7 94 37 1.3
4 2007-01-01 0 22.6 94 38 1.6
5 2007-01-01 0 22.7 95 46 2.3
[...]
131496 2021-12-31 0.0 24.7 87 47 2.6
( "[...]" stands for sequence of data from 2007 - 2014).
I first calculated daily mean temperature for each of my entry dates as follows:
md$dia<-as.Date(md$dia,format = "%d/%m/%Y")
m_tc<-aggregate(tc_h ~ dia, md, mean)
This returned me a data frame with mean temperature values for each analyzed year.
Now, I want to calculate the mean temperature for each day of the year from this data frame, i.e: mean temperature for January 1st up to December 31st.
Thus, I need to end up with a data frame with 365 rows, but I don't know how to do such calculation. Can anyone help me out?
Also, there is a complication: I have 4 leap years in my data frame. Any recommendations on how to deal with them?
Thankfully
First simulate a data set with the relevant columns and number of rows, then aggregate by day giving m_tc.
As for the question, create an auxiliary variable mdia by formating the dates column as month-day only. Compute the means grouping by mdia. The result is a data.frame with 366 rows and 2 columns as expected.
set.seed(2022)
# number of rows in the question
n <- 131496L
dia <- seq(as.Date("2007-01-01"), as.Date("2021-12-31"), by = "1 day")
md <- data.frame(
dia = sort(sample(dia, n, TRUE)),
tc_h = round(runif(n, 0, 40), 1)
)
m_tc <- aggregate(tc_h ~ dia, md, mean)
mdia <- format(m_tc$dia, "%m-%d")
final <- aggregate(tc_h ~ mdia, m_tc, mean)
str(final)
#> 'data.frame': 366 obs. of 2 variables:
#> $ mdia: chr "01-01" "01-02" "01-03" "01-04" ...
#> $ tc_h: num 20.2 20.4 20.2 19.6 20.7 ...
head(final, n = 10L)
#> mdia tc_h
#> 1 01-01 20.20741
#> 2 01-02 20.44143
#> 3 01-03 20.20979
#> 4 01-04 19.63611
#> 5 01-05 20.69064
#> 6 01-06 18.89658
#> 7 01-07 20.15992
#> 8 01-08 19.53639
#> 9 01-09 19.52999
#> 10 01-10 19.71914
Created on 2022-10-18 with reprex v2.0.2
You can pass your data to the function using the pipe (%>%) from R package (magrittr) and calculate the mean values by calling R package (dplyr):
library(dplyr); library(magrittr)
tcmean<-md %>% group_by(dia) %>% summarise(m_tc=mean(tc_h))
I want to take the average of each column (except the date) after every seven rows. I tried the approach below, but I was getting incorrect values. This method also seems really long. Is there a way to shorten it?
bankamerica = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/bankamerica.csv')
library(tidyverse)
GroupLabels <- 0:(nrow(bankamerica) - 1)%/% 7
bankamerica$Group <- GroupLabels
Avgs <- bankamerica %>%
group_by(bankamerica$Group) %>%
summarize(Avg = mean(bankamerica$tr))
EDITED: Just realized this code provides the incorrect values
I think you're on the right path.
bankamerica %>%
mutate(group = cumsum(row_number() %% 7 == 1)) %>%
group_by(group) %>%
summarise(caldate = first(caldate), across(-caldate, mean)) %>%
select(-group)
## A tibble: 144 × 3
# caldate tr var
# <chr> <dbl> <dbl>
# 1 1/2/01 28.9 -50.6
# 2 1/11/01 23.6 -45.4
# 3 1/23/01 20.9 -45
# 4 2/1/01 17.4 -48
# 5 2/12/01 14.4 -48
# 6 2/21/01 17 -48.9
# 7 3/2/01 19.1 -56
# 8 3/13/01 19.4 -56.9
# 9 3/22/01 23.3 -55.7
#10 4/2/01 7.71 -58.3
This averages every 7 rows not every 7 days, because there are missing days in the data.
I am trying to calculate the mean temperature per month of daily records between 1988 to 2020 using the following code:
(Temperature_year_month <- (na.omit(database_PE_na) %>% group_by(month) %>% summarise(mean_temp_monthYear = mean(Air.Temp.Mean))))
and I got the following results, that I checked in excel and it seems correct:
# A tibble: 12 x 2
month mean_temp_monthYear
<dbl> <dbl>
1 1 11.4
2 2 13.5
3 3 17.2
4 4 21.2
5 5 26.0
6 6 31.0
7 7 33.3
8 8 32.5
9 9 29.1
10 10 22.4
11 11 15.4
12 12 10.7
However when I do this only for the month of July (month =7). I got a different result:
(Temperature_year_month <- (na.omit(database_PE_na) %>% group_by(month=7) %>% summarise(mean_temp_monthYear = mean(Air.Temp.Mean))))
month mean_temp_monthYear
<dbl> <dbl>
1 7 22.0
Someone could explain to me why this happens¿
We can use data.table methods
library(data.table)
setDT(database_PE_na)[month == 7,
.(mean_temp_monthYear = mean(Air.Temp.Mean, na.rm = TRUE))]
For comparison use == and not =.
If you want to get mean of one month use it in filter instead of group_by.
mean has na.rm argument which can be set to TRUE to ignore NA values instead of using na.omit and removing the complete row.
Use :
library(dplyr)
Temperature_year_month <- database_PE_na %>%
filter(month==7) %>%
summarise(mean_temp_monthYear = mean(Air.Temp.Mean, na.rm = TRUE))
There is a longitudinal data set in the wide format, from which I want to compute time (in years and days) between the first observation date and the last date an individual was observed. Dates are in the format yyyy-mm-dd. The data set has four observation periods with missing dates, an example is as follows
df1<-data.frame("id"=c(1:4),
"adate"=c("2011-06-18","2011-06-18","2011-04-09","2011-05-20"),
"bdate"=c("2012-06-15","2012-06-15",NA,"2012-05-23"),
"cdate"=c("2013-06-18","2013-06-18","2013-04-09",NA),
"ddate"=c("2014-06-15",NA,"2014-04-11",NA))
Here "adate" is the first date and the last date is the date an individual was last seen. To compute the time difference (lastdate-adate), I have tried using "lubridate" package, for example
lubridate::time_length(difftime(as.Date("2012-05-23"), as.Date("2011-05-20")),"years")
However, I'm challenged by the fact that the last date is not coming from one column. I'm looking for a way to automate the calculation in R. The expected output would look like
id years days
1 1 2.99 1093
2 2 2.00 731
3 3 3.01 1098
4 4 1.01 369
Years is approximated to 2 decimal places.
Another tidyverse solution can be done by converting the data to long format, removing NA dates, and getting the time difference between last and first date for each id.
library(dplyr)
library(tidyr)
library(lubridate)
df1 %>%
pivot_longer(-id) %>%
na.omit %>%
group_by(id) %>%
mutate(value = as.Date(value)) %>%
summarise(years = time_length(difftime(last(value), first(value)),"years"),
days = as.numeric(difftime(last(value), first(value))))
#> # A tibble: 4 x 3
#> id years days
#> <int> <dbl> <dbl>
#> 1 1 2.99 1093
#> 2 2 2.00 731
#> 3 3 3.01 1098
#> 4 4 1.01 369
We could use pmap
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
mutate(out = pmap(.[-1], ~ {
dates <- as.Date(na.omit(c(...)))
tibble(years = lubridate::time_length(difftime(last(dates),
first(dates)), "years"),
days = lubridate::time_length(difftime(last(dates), first(dates)), "days"))
})) %>%
unnest_wider(out)
# A tibble: 4 x 7
# id adate bdate cdate ddate years days
# <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
#1 1 2011-06-18 2012-06-15 2013-06-18 2014-06-15 2.99 1093
#2 2 2011-06-18 2012-06-15 2013-06-18 <NA> 2.00 731
#3 3 2011-04-09 <NA> 2013-04-09 2014-04-11 3.01 1098
#4 4 2011-05-20 2012-05-23 <NA> <NA> 1.01 369
Probably most of the functions introduced here might be quite complex. You should try to learn them if possible. Although will provide a Base R approach:
grp <- droplevels(interaction(df[,1],row(df[-1]))) # Create a grouping:
days <- tapply(unlist(df[-1]),grp, function(x)max(x,na.rm = TRUE) - x[1]) #Get the difference
cbind(df[1],days, years = round(days/365,2)) # Create your table
id days years
1.1 1 1093 2.99
2.2 2 731 2.00
3.3 3 1098 3.01
4.4 4 369 1.01
if comfortable with other higher functions then you could do:
dat <- aggregate(adate~id,reshape(df1,list(2:ncol(df1)), dir="long"),function(x)max(x) - x[1])
transform(dat,year = round(adate/365,2))
id adate year
1 1 1093 2.99
2 2 731 2.00
3 3 1098 3.01
4 4 369 1.01
Using base R apply :
df1[-1] <- lapply(df1[-1], as.Date)
df1[c('years', 'days')] <- t(apply(df1[-1], 1, function(x) {
x <- na.omit(x)
x1 <- difftime(x[length(x)], x[1], 'days')
c(x1/365, x1)
}))
df1[c('id', 'years', 'days')]
# id years days
#1 1 2.994521 1093
#2 2 2.002740 731
#3 3 3.008219 1098
#4 4 1.010959 369
The solution to this question by #ShirinYavari was almost what I needed except for the use of the static averaging window width of 2. I have a dataset with random samples from multiple stations that I want to calculate a rolling 30-day geomean. I want all samples within a 30-day window of a given sample to be averaged and the width may change if preceding samples are farther or closer together in time, for instance whether you would need to average 2, 3, or more samples if 1, 2, or more preceding samples were within 30 days of a given sample.
Here is some example data, plus my code attempt:
RESULT = c(50,900,25,25,125,50,25,25,2000,25,25,
25,25,25,25,25,25,325,25,300,475,25)
DATE = as.Date(c("2018-05-23","2018-06-05","2018-06-17",
"2018-08-20","2018-10-05","2016-05-22",
"2016-06-20","2016-07-25","2016-08-11",
"2017-07-21","2017-08-08","2017-09-18",
"2017-10-12","2011-04-19","2011-06-29",
"2011-08-24","2011-10-23","2012-06-28",
"2012-07-16","2012-08-14","2012-09-29",
"2012-10-24"))
FINAL_SITEID = c(rep("A", 5), rep("B", 8), rep("C", 9))
df=data.frame(FINAL_SITEID,DATE,RESULT)
data_roll <- df %>%
group_by(FINAL_SITEID) %>%
arrange(DATE) %>%
mutate(day=DATE-dplyr::lag(DATE, n=1),
day=replace_na(day, 1),
rnk=cumsum(c(TRUE, day > 30))) %>%
group_by(FINAL_SITEID, rnk) %>%
mutate(count=rowid(rnk)) %>%
mutate(GM30=rollapply(RESULT, width=count, geometric.mean, fill=RESULT, align="right"))
I get this error message, which seems like it should be an easy fix, but I can't figure it out:
Error: Column `rnk` must be length 5 (the group size) or one, not 6
Easiest way to compute rolling statistics depending on datetime windows is runner package. You don't have to hack around to get just 30-days windows. Function runner allows you to apply any R function in rolling window. Below example of 30-days geometric.mean within FINAL_SITEID group:
library(psych)
library(runner)
df %>%
group_by(FINAL_SITEID) %>%
arrange(DATE) %>%
mutate(GM30 = runner(RESULT, k = 30, idx = DATE, f = geometric.mean))
# FINAL_SITEID DATE RESULT GM30
# <fct> <date> <dbl> <dbl>
# 1 C 2011-04-19 25 25.0
# 2 C 2011-06-29 25 25.0
# 3 C 2011-08-24 25 25.0
# 4 C 2011-10-23 25 25.0
# 5 C 2012-06-28 325 325.
# 6 C 2012-07-16 25 90.1
# 7 C 2012-08-14 300 86.6
# 8 C 2012-09-29 475 475.
# 9 C 2012-10-24 25 109.
# 10 B 2016-05-22 50 50.0
The width argument of rollapply can be a vector of widths which can be set using findInterval. An example of this is shown in the Examples section of the rollapply help file and we use that below.
library(dplyr)
library(psych)
library(zoo)
data_roll <- df %>%
arrange(FINAL_SITEID, DATE) %>%
group_by(FINAL_SITEID) %>%
mutate(GM30 = rollapplyr(RESULT, 1:n() - findInterval(DATE - 30, DATE),
geometric.mean, fill = NA)) %>%
ungroup
giving:
# A tibble: 22 x 4
FINAL_SITEID DATE RESULT GM30
<fct> <date> <dbl> <dbl>
1 A 2018-05-23 50 50.0
2 A 2018-06-05 900 212.
3 A 2018-06-17 25 104.
4 A 2018-08-20 25 25.0
5 A 2018-10-05 125 125.
6 B 2016-05-22 50 50.0
7 B 2016-06-20 25 35.4
8 B 2016-07-25 25 25.0
9 B 2016-08-11 2000 224.
10 B 2017-07-21 25 25.0
# ... with 12 more rows