Daily mean of hourly data per parameter - r

I have the following data frame that shows hourly simulated heavy metal concentrations for two parameters:
Date<-c("2013-01-01 02:00:00","2013-01-01 03:00:00","2013-01-02 02:00:00","2013-01-02 03:00:00","2013-01-01 02:00:00","2013-01-01 03:00:00","2013-01-02 02:00:00","2013-01-02 03:00:00")
Parameter<-c("Par1","Par1","Par1","Par1","Par2","Par2","Par2","Par2")
sim1<-c(1,4,3,2,6,5,3,5)
sim2<-c(3,2,3,1,8,2,7,3)
obs<-data.frame(Date,Parameter,sim1,sim2)
obs$Date<-as.POSIXct(obs$Date)
I need the daily mean for each parameter. Any ideas? I tried to aggregatebut I couldnĀ“t figure out how to group by parameter and date.

We can convert the 'Date' to Date class with as.Date, use that in the group_by along with 'Parameter' and get the mean of the rest of the columns with summarise_all
library(tidyverse)
obs %>%
group_by(Daily = as.Date(Date), Parameter) %>%
summarise_all(mean)
Or using aggregate from base R
aggregate(.~ Date + Parameter, transform(obs, Date = as.Date(Date)), mean)
Or using by
by(obs[3:4], list(obs$Parameter, as.Date(obs$Date)), FUN = colMeans)

Related

Calculating yearly return from daily return data

I have imported daily return data for ADSK via a downloaded Yahoo finance .csv file.
ADSKcsv <- read.csv("ADSK.csv", TRUE)
I have converted the .csv file to a data frame
class(ADSKcsv)
I have selected the two relevant columns that I want to work with and sought to take the mean of all daily returns for each year. I do not know how to do this.
aggregate(Close~Date, ADSK, mean)
The above code yields a mean calculation for each date. My objective is to calculate YoY return from this data, first converting daily returns to yearly returns, then using yearly returns to calculate year-over-year returns. I'd appreciate any help.
May I suggest an easier approach?
library(tidyquant)
ADSK_yearly_returns_tbl <- tq_get("ADSK") %>%
tq_transmute(select = close,
mutate_fun = periodReturn,
period = "yearly")
ADSK_yearly_returns_tbl
If you run the above code, it will download the historical returns for a symbol of interest (ADSK in this case) and then calculate the yearly return. An added bonus to using this workflow is that you can swap out any symbols of interest without manually downloading and reading them in. Plus, it saves you the extra step of calculating the average daily return.
You can extract the year value from date and then do aggregate :
This can be done in base R :
aggregate(Close~year, transform(ADSKcsv, year = format(Date, '%Y')), mean)
dplyr
library(dplyr)
ADSKcsv %>%
group_by(year = format(Date, '%Y')) %>%
#Or using lubridate's year function
#group_by(year = lubridate::year(Date)) %>%
summarise(Close = mean(Close))
Or data.table
library(data.table)
setDT(ADSKcsv)[, .(Close = mean(Close)), format(Date, '%Y')]

R: summarise multiple columns with different summation functions using dplyr results in error?

I am transforming a customer journey dataset from user aggregation level to a day level aggregation. The problem is that I cannot simply sum or mean all columns, as not all variables can be aggregated in the same way. For example, duration is a variable that I want to summarise via mean, while purchase_own is a variable that I want to summarise via sum.
I used dplyr to get this working, but it gives me an error. I tried the following code:
CJd <- CJre %>% group_by(date) %>% summarise_at(vars(purchase_own, purchase_any, CIT,
FIT, T1:T22, devicemobile, devicefixed, purchase_comp, POS_comp, POS_own, POS_any,
markov, first_touch, last_touch, linear_touch), sum)
%>% summarise_at(vars(duration, difference), mean) %>% summarise_at(CountTP, max)
This results in an error:
Error in .f(.x[[i]], ...) : object 'duration' not found
I suspect that this means that summarise_at(vars(duration, difference), mean) is not allowed as second summarise code. Now my question is, how can I write the summarise function so that summation is different for some variables?
Actual results is that only the first summarise_at gets executed, which results in missing variables in my dataset. The missing variables need to be summarised with mean and max, respectively. The expected outcome is these variables grouped by date and summarised by the named functions mean or max are added to the dataset.
The issue is that after the first summarise_at which didn't include 'duration', therefore, the column is not there in the summarised data. Instead, if we use mutate_at, and create a column, then get the distinct rows of the data and summarise
CJre %>%
group_by(date) %>%
mutate_at(vars(purchase_own, purchase_any, CIT,
FIT, T1:T22, devicemobile, devicefixed, purchase_comp,
POS_comp, POS_own, POS_any,
markov, first_touch, last_touch, linear_touch), sum) %>%
group_by(purchase_own, purchase_any, CIT,
FIT, T1:T22, devicemobile, devicefixed, purchase_comp,
POS_comp, POS_own, POS_any,
markov, first_touch, last_touch, linear_touch, add = TRUE) %>%
summarise_at(vars(duration, difference), mean)
markov, first_touch, last_touch, linear_touch), sum)

Dividing values in each cell by the group average in R

I am trying to generate a new column with values derived from the original chart. I would like to calculate the group average of same hotel and same date first, then use this group averages to divide the original sales.
Here is my code: I tried to calculate the group average by using group_by and summarise embedding in dplyr package, however, it did not generate my expected results.
hotel = c(rep("Hilton",3), rep("Caesar",3))
date1 = c(rep('2018-01-01',2), '2018-01-02', rep('2018-01-01',3))
dba = c(2,0,1,3,2,1)
sales = c(3,5,7,5,2,3)
df = data.frame(cbind(hotel, date1, dba, sales))
df1 = df %>%
group_by(date1, hotel) %>%
dplyr::summarise(avg = mean(sales)) %>%
acast(., date1~hotel)
Any suggestion would be highly appreciated!
Instead of summarise, we can use mutate. After grouping by 'date1', 'hotel', divide the 'sales' by the mean of 'sales' to create a new column
library(tidyverse)
df %>%
group_by(date1, hotel) %>%
mutate(SalesDividedByMean = sales/mean(sales))
NOTE: When there are columns having different types, cbinding results in a matrix and matrix can have only a single type. So, a character class vector can change the whole data into character. Wrapping with data.frame, propagate that change into either factor (by default stringsAsFactors = TRUE or `character)
data
df <- data.frame(hotel, date1, dba, sales)

Find the Mean of a Value by Year (not the whole date)

I'll preface this by saying I'm very much a self taught beginner with R.
I have a very large data set looking at biological data.
I want to find the average of a variable "shoot.density" split by year, but my date data is entered as "%d/%m/%y". This means using the normal way I would achieve this splits by each individual date, rather than by year only, eg.
tapply(df$Shoot.Density, list(df$Date), mean)
Any help would be much appreciated. I am also happy to paste in a section of my data, but I'm not sure how.
If your data is in date-class, you can use format to transform your date column to a year variable:
tapply(df$Shoot.Density, list(format(df$Date, '%Y')), mean)
If it is in the format %d/%m/%y, you need the substr function:
tapply(df$Shoot.Density, list(substr(df$Date,7,8)), mean)
You can also do this with dplyr:
library(dplyr)
df %>%
group_by(years = format(df$Date, '%Y')) %>%
summarise(means = mean(Shoot.Density))
Another way to do this is with the year function of the data.table package:
library(data.table)
setDT(df)[, mean(Shoot.Density), by = year(Date)]

How to group data by date by an operation (mean) without affecting the existing dimensions of the data frame in R?

Given the following dataset:
Hours<-c(2,3,4,2,1,1,3)
Project<-c("a","b","b","a","a","b","a")
Period<-c("2014-11-22","2014-11-23","2014-11-24","2014-11-22", "2014-11-23", "2014-11-23", "2014-11-24")
cd=data.frame(Project,Hours,Period)
My goal is to group the hours by mean by date without compromising the data frame structure. See goal:
Hours_goal<-c(2,1.6,3.5,2,1.6,1.6,3.5)
Project_goal<-c("a","b","b","a","a","b","a")
Period_goal<-c("2014-11-22","2014-11-23","2014-11-24","2014-11-22", "2014-11-23", "2014-11-23", "2014-11-24")
cd_goal=data.frame(Project_goal,Hours_goal,Period_goal)
As you can see above, the project and period columns do not change, but the end goal is to contain mean hours by a single day. For example, for 2014-11-23, the original data have values 3,1, and 1. But the mean of these values is 1.6. Therefore, 1.6 has been inserted in place of all these values for this date in this column.
Try
cd$Hours <- with(cd, ave(Hours, Period, FUN = function(x) mean(x, na.rm=TRUE)))
names(cd) <- paste(names(cd), 'goal', sep="_")
Or
library(dplyr)
cd %>%
group_by(Period) %>%
mutate(Hours=mean(Hours, na.rm=TRUE))
Or
library(data.table)
setDT(cd)[, Hours:= mean(Hours, na.rm=TRUE), by=Period]

Resources