Using do(pad(.)) in R and keeping all column names - r

I am attempting to fill in the missing dates in my dataset using the pad function. If I use regular pad such as
data %>% pad(group = GROUP2)
then it works fine and keeps the column values such as brand, device, etc.
However, some of my data occurs at day intervals, some at week and some at month. Therefore, I want to use pad with do so that the time interval is determined individually for each group. When I run the below the whole row comes back as NA rather than keeping any of the column values like before.
library(dplyr)
library(padr)
padded_data <- data %>%
dplyr::group_by_at(GROUP2) %>%
do(pad(.))
I have tried to research around this but not found anything!

Related

Trying to use ddply to subset a dataframe by two column variables, then find the maximum of a third column in r?

I have a dataframe called data with variables for data, time, temperature, and a group number called Box #. I'm trying to subset the data to find the maximum temperature for each day, for each box, along with the time that temperature occurred at. Ideally I could place this data into a new dataframe with the date, time, maximum temperature and the time is occurred at.
I tried using ddply but was the code only returns one line of output
ddply(data, .('Box #', 'Date'), summarize, max('Temp'))
I was able to find the maximum temperatures for each day using tapply on separate dataframes that only contain the values for individual groups
mx_day_2 <- tapply(box2$Temp, box2$Date, max)
I was unable to apply this to the larger dataframe with all groups and cannot figure out how to also get time from this code.
Is it possible to have ddply subset by both Box # and Date, then return two separate outputs of both maximum temperature and time, or do I need to use a different function here?
Edit: I managed to get the maximum times using a version of the code in the answer below, but still haven't figured out how to find the time at which the max occurs in the same data. The code that worked for the first part was
max_data <- data %>%
group_by(data$'Box #', data$'Date')
max_values <- summarise(max_data, max_temp=max(Temp, na.rm=TRUE))
I would use dplyr/tidyverse in stead of plyr, it's an updated version of the package. And clean the column names with janitor: a space is difficult to work with (it changes 'Box #' to box_number).
library(tidyverse)
library(janitor)
mx_day2 <- data %>%
clean_names() %>%
group_by(date,box_number)%>%
summarise(max_temp=max(temp, na.rm=TRUE)
I found a solution that pulls full rows from the initial dataframe into a new dataframe based on only max values. Full code for the solution below
max_data_v2 <- data %>%
group_by(data$'Box #', data$'Date') %>%
filter(Temp == max(Temp, na.rm=TRUE))

How to separate a time series panel by the number of missing observations at the end?

Consider a set of time series having the same length. Some have missing data in the end, due to the product being out of stock, or due to delisting.
If the series contains at least four missing observations (in my case it is value = 0 and not NA) at the end, I consider the series as delisted.
In my time series panel, I want to separate the series with delisted id's from the other ones and create two different dataframes based on this separation.
I created a simple reprex to illustrate the problem:
library(tidyverse)
library(lubridate)
data <- tibble(id = as.factor(c(rep("1",24),rep("2",24))),
date = rep(c(ymd("2013-01-01")+ months(0:23)),2),
value = c(c(rep(1,17),0,0,0,0,2,2,3), c(rep(9,20),0,0,0,0))
)
I am searching for a pipeable tidyverse solution.
Here is one possibility to find delisted ids
data %>%
group_by(id) %>%
mutate(delisted = all(value[(n()- 3):n()] == 0)) %>%
group_by(delisted) %>%
group_split()
In the end I use group_split to split the data into two parts: one containing delisted ids and the other one contains the non-delisted ids.

how to average a set of columns and exclude other specific columns in R using the summarise command?

I'm breaking my head here with academic work. I have a data.frame with several numeric columns. I am using the command summarize and group_by in R to perform the average calculations of my data frame.
I tried with the code summarize (across (where (is.numeric), mean), -c(Mes, year_date), but it calculates the average of the entire data.frame and in addition, it creates a new column -c (Mes, year_date)), I would like some numeric columns to be excluded from the media calculation, but continue on the data.frame.
Note that I tried -c(Mes, year_date) to exclude these two columns from the average calculation, but it didn't work.
I tried
library(tidyr)
library(dplyr)
library(lubridate)
sample_station <-c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','C','C','C','C','C','C','C','C','C','C','A','B','C','A','B','C')
Date_dmy <-c('01/01/2000','08/08/2000','16/03/2001','22/09/2001','01/06/2002','05/01/2002','26/01/2002','16/02/2002','09/03/2002','30/03/2002','20/04/2002','04/01/2000','11/08/2000','19/03/2001','25/09/2001','04/06/2002','08/01/2002','29/01/2002','19/02/2002','12/03/2002','13/09/2001','08/01/2000','15/08/2000','23/03/2001','29/09/2001','08/06/2002','12/01/2002','02/02/2002','23/02/2002','16/03/2002','06/04/2002','01/02/2000','01/02/2000','01/02/2000','02/11/2001','02/11/2001','02/11/2001')
temperature <-c(17,20,24,19,17,19,23,26,19,19,21,15,23,18,22,22,23,18,19,26,21,22,23,27,19,19,21,23,24,25,26,29,30,21,25,24,23)
wind_speed<-c(3.001,6.332,9.321,10.9091,6.38,10.5882,10.5,10.4348,10.3846,10.3448,10.3125,8.35,10.2632,10.2439,10.2273,10.2128,10.2,10.1887,10.1786,12,10.1613,10.1538,10.1471,10.1408,10.1351,10.1299,10.125,2.36,10.1163,10.1124,10.1087,11.2,10.102,10.099,10.0962,10.0935,10.0909)
esp<-c(11.6,11.3,11,10.7,10.4,10.1,9.8,9.5,9.2,8.9,8.6,8.3,8,11.2,10.9,10.6,10.3,10,12.8,12.5,12.2,11.9,11.6,11.3,11,4.36,4.06,3.76,3.46,3.16,2.86,2.56,2.26,1.96,1.66,1.36,23)
volum<-c(300,300,300,300,300,300,300,300,250,250,250,250,250,250,400,400,400,400,400,105,105,105,105,105,105,105,105,105,105,81,81,81,81,81,81,81,81)
df<-data.frame(sample_station, Date_dmy, temperature, wind_speed, esp, volum)%>%
mutate(Date_dmy = dmy(Date_dmy)) %>%
mutate(year_date = floor_date(Date_dmy,'year'))%>%
mutate(Ano=year(Date_dmy))%>%
mutate(Mes=month(Date_dmy))%>%
mutate(Epoca = ifelse(Mes %in% 4:9,'dry','rainy'))%>%
group_by(sample_station, Epoca, Ano)%>%
summarise(across(where(is.numeric), mean), -c(Mes, year_date))
I have several columns that I don't want to be averaged (even if they are numeric). For exemple, columns esp and volum.
update
Exit expectation
Because you are summarising only part of the data, you need to specify what data (rows) of the un-summarised data you want to maintain. In your example, you don't want to summarise Mes and year_date, however you have multiple values within each group (sample_station, Epoca, Ano), of these Mes and year_date columns.
Which values of these unsummarised columns do you want to keep?
If you want to keep all values of the unsummarised columns, you may want to include Mes and year_date inside group_by(sample_station, Epoca, Ano) before summarising.
Alternatively, you may use mutate() rather than summarise() to get summary values in a new column for each row of the original dataframe, then choose your rows from there.
Update:
Again, with your edited post including desired output, what values do you expect for Mes. For example, when sample_station == 'A', Epoca == 'rainy' and Ano == 2000, you have values for Mes of 1 & 2, and the same year_date. summarise() wants to calculate one single summary value for this group.
You can use across(c(where(is.numeric), -Mes). Note that year_date is not included in the calculation as it is not of class numeric and also because it is included in group_by.
You can also combine multiple mutate statements into one.
If you want to exclude certain columns from the average calculation but want to keep it in the dataframe you need to decide which value do you want to keep. For example, to keep the 1st value you can use first.
library(dplyr)
library(lubridate)
data.frame(sample_station, Date_dmy, temperature, wind_speed)%>%
mutate(Date_dmy = dmy(Date_dmy),
year_date = floor_date(Date_dmy,'year'),
Ano=year(Date_dmy),
Mes=month(Date_dmy),
Epoca = ifelse(Mes %in% 4:9,'dry','rainy')) %>%
group_by(sample_station, year_date, Epoca) %>%
summarise(across(c(where(is.numeric), -Mes), mean),
across(Mes, first))

Using summarize (or equivalent?) to create a column of functions in an R dataframe

I'm working with some electricity data that has, for each hour, day and asset a step function which specifies the asset's offering of power at escalating prices. What I'd like to do is collapse those data into a data frame, tibble, etc. with date, time, asset and a row-specific step function. I'll then use that step function to populate some other columns later on.
Here's a quick reproducible example of what I want to do.
library(dplyr)
df_test<-data.frame(rep(1:25, times=1, each=4))
names(df_test)[1]<-"asset"
df_test$block<-rep(1:4, times=25)
df_test$from<-rep(seq(0,150,50), times=25)
df_test$to<-df_test$from+50
df_test$index<-runif(100)*100
df_test<-df_test %>% group_by(asset) %>% mutate(price=cumsum(index))
This is basically an example of what I would have for each hour of each day, except that in my case, the numbers of blocks are different (some firms bid a single block, others bid up to 7 blocks, but that's likely not material to the problem here).
Now, what I would like to do is, for each asset, calculate a step function using
the from, to, and price blocks and store it in a data frame by asset (again, in my extended case, it will be by date, hour, and asset).
For example, using the first group I could do this
generate_func<-function(x,y){
stepfun(x, y, f = as.numeric(0), ties = "ordered",right = FALSE)
}
eg_func<-generate_func(df_test$from[2:4],df_test$price[1:4])
The function eg_func lets me find the implied price at any value x for asset 1.
eg_func(500)
[1] 43.10305
What I'd like to do is group my data by asset and then store a version of eg_func for each asset in a second column of a data frame or equivalent.
Basically, what I want to do is something like:
df_sum<-df_test %>% group_by(asset) %>% summarize(
step_func=generate_func(from[-1],price)
)
But I get:
Error: Column `step_func` is of unsupported type function
Update:
#akrun has gotten me a step down the road. So, if I wrap the function in a list, I can do what I want to do...at least the first step:
df_func<-df_test %>%
group_by(asset) %>%
summarize(step_func=list(generate_func(from[-1],price)))
So now I have a data frame with a step function for each asset. Now, my next quest is to be able to evaluate that function to create a new column evaluating the step function at a particular value. So, for example, I can evaluate the first asset's bid at a value of 50:
df_func[1,2][[1]][[1]](50)
[1] 49.60776
I'd like to be able to do this in a mutate command, so something akin to:
df_func <-df_func %>% mutate(bid_50=step_func[[2]](50))
But that applies the second step function to everyone. How do I fill column bid_50 with each asset's step function evaluated at 50?
Update #2 #akrun again with the solution:
df_func <-df_func %>% mutate(bid_50=map_dbl(step_func, ~ .x(50)))
It is better to wrap it in a list as eg_func is a function and then extract the list elements with map apply the function on the argument passed to create a new column 'bid_50'
library(tidyverse)
df_test %>%
group_by(asset) %>%
summarize(step_func=list(generate_func(from[-1],price))) %>%
mutate(bid_50 = map_dbl(step_func, ~ .x(50)))

R aggregating irregular time series data by groups (with meta data)

Hi I have a data frame (~4 million rows) with time series data for different sites and events.
Here is a rough idea of my data, obviously on a different scale, I have several similar time series so I've kept it general as I want to be able to apply it in different cases
Data1 <- data.frame(DateTimes =as.POSIXct("1988-04-30 13:20:00")+c(1:10,12:15,20:30,5:13,16:20,22:35)*300,
Site = c(rep("SiteA",25),rep("SiteB",28)),
Quality = rep(25,53),
Value = round(runif(53,0,5),2),
Othermetadata = c(rep("E1",10),rep("E2",15),rep("E1",10),rep("E2",18)))
What I'm looking for is a simple way to group and aggregate this data to different timesteps while keeping metadata which doesn't vary within the group
I have tried using the zoo library and zoo::aggregate ie:
library(zoo)
zooData <- read.zoo(select(Data1, DateTimes, Value))
zooagg <- aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
However when I do this I'm losing all my metadata and merging different sites data.
I wondered about trying to use plyr or dplyr to split up the data and then appling the aggregate but I'm still going to lose my other columns.
Is there a better way to do this? I had a brief look at doco for xts library but couldn't see an intuitive solution in their either
*Note: as I want this to work for a few different things both the starting time step and final time step might change. With possibility for random time step, or somewhat regular time step but with missing points. And the FUN applied may vary (mostly sum or mean). As well as the fields I want to split it by *
Edit I found the solution after Hercules Apergis pushed me in the right direction.
newData <- Data1 %>% group_by(timeagg, Site) %>% summarise(Total = sum(Value))
finaldata <- inner_join(Data1,newData) %>% select(-DateTimes, - Value) %>% distinct()
The original DateTimes column wasn't a grouping variable - it was the time series, so I added a grouping variable of my aggregated time (here: time to the nearest hour) and summarised on this. Problem was if I joined on this new column I missed any points where there was time during that hour but not on the hour. Thus the inner_join %>% select %>% distinct method.
Now hopefully it works with my real data not eg data!
Given the function that you have on aggregation:
aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
You want to sum the values by group of times AND NOT lose other columns. You can simply do this with the dplyr package:
library(dplyr)
newdata <- Data1 %>% group_by(DateTimes) %>% summarise(sum(Value))
finaldata <- inner_join(Data1,newdata),by="DateTimes")
The newdata is a data.frame with each group of DateTimes has the Values summed. Then inner_join merges the parts that are common on those two datasets by the DateTimes variable. Since I am not entirely sure what your desired output is, this should be a good help for starters.

Resources