Lag function usage within a dplyr subset - r

My basic goal is to subset a data set, and summarise with new columns that use the lag function. I understand how to subset and the data set, but am struggling to complete using the lag function within my data set and that is giving me trouble.
I have already tried a few different ways of implementing it, but have been unsuccessful.
gapminder %>%
na.omit() %>%
group_by(country) %>%
summarise(prevPeriod = lag(year),
lifeExpGrowth = lag(lifeExp),
popGrowth = lag(pop),
gdppcGrowth = 100*(gdpPercap/lag(gdpPercap) - 1)))
I am currently getting my code to run a lag based upon the country, not the year. the gdppcGrowth is supposed to return a percent as well and I am getting an error;
Column `gdppcGrowth` must be length 1 (a summary value), not 12
For each of the functions, I want to analyze the data by country focusing on growth rates. I want to use the lag(x) function to access the previous value of a series or vector so that 100*(x/lag(x) - 1) computes standard (arithmetic) growth rates of x expressed as a percent.

Related

Trying to use ddply to subset a dataframe by two column variables, then find the maximum of a third column in r?

I have a dataframe called data with variables for data, time, temperature, and a group number called Box #. I'm trying to subset the data to find the maximum temperature for each day, for each box, along with the time that temperature occurred at. Ideally I could place this data into a new dataframe with the date, time, maximum temperature and the time is occurred at.
I tried using ddply but was the code only returns one line of output
ddply(data, .('Box #', 'Date'), summarize, max('Temp'))
I was able to find the maximum temperatures for each day using tapply on separate dataframes that only contain the values for individual groups
mx_day_2 <- tapply(box2$Temp, box2$Date, max)
I was unable to apply this to the larger dataframe with all groups and cannot figure out how to also get time from this code.
Is it possible to have ddply subset by both Box # and Date, then return two separate outputs of both maximum temperature and time, or do I need to use a different function here?
Edit: I managed to get the maximum times using a version of the code in the answer below, but still haven't figured out how to find the time at which the max occurs in the same data. The code that worked for the first part was
max_data <- data %>%
group_by(data$'Box #', data$'Date')
max_values <- summarise(max_data, max_temp=max(Temp, na.rm=TRUE))
I would use dplyr/tidyverse in stead of plyr, it's an updated version of the package. And clean the column names with janitor: a space is difficult to work with (it changes 'Box #' to box_number).
library(tidyverse)
library(janitor)
mx_day2 <- data %>%
clean_names() %>%
group_by(date,box_number)%>%
summarise(max_temp=max(temp, na.rm=TRUE)
I found a solution that pulls full rows from the initial dataframe into a new dataframe based on only max values. Full code for the solution below
max_data_v2 <- data %>%
group_by(data$'Box #', data$'Date') %>%
filter(Temp == max(Temp, na.rm=TRUE))

group_by and summarize usage in tidyverse package in r

I am analyzing the COVID-19 data in r and I want to get the aggregate result of total case in different continent.
total_cases_continent <- covid_data %>%
select(continent, new_cases) %>%
group_by(continent) %>%
summarize(total_cases = sum(new_cases))
I get this result below, instead of present total cases in different continent, this only shows total cases in different continent in one row
It looks like there might be some issues with the values of your variable "continent". I would recommend checking the class of the variable, as well as making sure all of the values are what you expected them to be. This is probably causing the issues within your group_by statement.

Apply a test by groups

I have a data sample on five-minute asset price returns (FiveMinRet) and select events for a period covering several years. These events are hypothesized to have an effect on the FiveMinRet (/causing non-zero abnormal returns). From the time series data sample, I construct a sub-sample containing for all events only the, say, 100 minutes (windows) around each event, (sub_sample).
As a part of a preliminary data analysis, I wish to formally test for the presence of heteroskedasticity and first-order autocorrelation within each window. Each window occurs on different dates, so a variable (Date) will be my grouping variable.
So my question in this regard is: Is there a way to apply a Ljung-Box test (Box.test(x, lag = 1, type = c("Box-Pierce", "Ljung-Box"), fitdf = 0) command in R) by groups (Date variable) and to present the test statistics/test results in a list or data frame?
I tried the following approach
Testresults = df %>% group_by(Date) %>% do(tidy(Box.test(df$FiveMinRet_sq, lag = 1, type = c("Ljung-Box"), fitdf = 0)))
The output is what I am looking for, however, by this approach, I obtain the same test statistics for all dates, so my approach is incorrect.
Without a reproducible example in the question I can't test the code for this solution, but one way to separate the tests by Date is to use split() and purrr::map().
df %>%
split(.$Date) %>%
purrr::map(.,function(x){
do(tidy(Box.test(x$FiveMinRet_sq, lag = 1, type = c("Ljung-Box"), fitdf = 0)))
}) -> testResults
# combine into a data frame
as.data.frame(do.call(rbind,testResults))

Using summarize (or equivalent?) to create a column of functions in an R dataframe

I'm working with some electricity data that has, for each hour, day and asset a step function which specifies the asset's offering of power at escalating prices. What I'd like to do is collapse those data into a data frame, tibble, etc. with date, time, asset and a row-specific step function. I'll then use that step function to populate some other columns later on.
Here's a quick reproducible example of what I want to do.
library(dplyr)
df_test<-data.frame(rep(1:25, times=1, each=4))
names(df_test)[1]<-"asset"
df_test$block<-rep(1:4, times=25)
df_test$from<-rep(seq(0,150,50), times=25)
df_test$to<-df_test$from+50
df_test$index<-runif(100)*100
df_test<-df_test %>% group_by(asset) %>% mutate(price=cumsum(index))
This is basically an example of what I would have for each hour of each day, except that in my case, the numbers of blocks are different (some firms bid a single block, others bid up to 7 blocks, but that's likely not material to the problem here).
Now, what I would like to do is, for each asset, calculate a step function using
the from, to, and price blocks and store it in a data frame by asset (again, in my extended case, it will be by date, hour, and asset).
For example, using the first group I could do this
generate_func<-function(x,y){
stepfun(x, y, f = as.numeric(0), ties = "ordered",right = FALSE)
}
eg_func<-generate_func(df_test$from[2:4],df_test$price[1:4])
The function eg_func lets me find the implied price at any value x for asset 1.
eg_func(500)
[1] 43.10305
What I'd like to do is group my data by asset and then store a version of eg_func for each asset in a second column of a data frame or equivalent.
Basically, what I want to do is something like:
df_sum<-df_test %>% group_by(asset) %>% summarize(
step_func=generate_func(from[-1],price)
)
But I get:
Error: Column `step_func` is of unsupported type function
Update:
#akrun has gotten me a step down the road. So, if I wrap the function in a list, I can do what I want to do...at least the first step:
df_func<-df_test %>%
group_by(asset) %>%
summarize(step_func=list(generate_func(from[-1],price)))
So now I have a data frame with a step function for each asset. Now, my next quest is to be able to evaluate that function to create a new column evaluating the step function at a particular value. So, for example, I can evaluate the first asset's bid at a value of 50:
df_func[1,2][[1]][[1]](50)
[1] 49.60776
I'd like to be able to do this in a mutate command, so something akin to:
df_func <-df_func %>% mutate(bid_50=step_func[[2]](50))
But that applies the second step function to everyone. How do I fill column bid_50 with each asset's step function evaluated at 50?
Update #2 #akrun again with the solution:
df_func <-df_func %>% mutate(bid_50=map_dbl(step_func, ~ .x(50)))
It is better to wrap it in a list as eg_func is a function and then extract the list elements with map apply the function on the argument passed to create a new column 'bid_50'
library(tidyverse)
df_test %>%
group_by(asset) %>%
summarize(step_func=list(generate_func(from[-1],price))) %>%
mutate(bid_50 = map_dbl(step_func, ~ .x(50)))

R aggregating irregular time series data by groups (with meta data)

Hi I have a data frame (~4 million rows) with time series data for different sites and events.
Here is a rough idea of my data, obviously on a different scale, I have several similar time series so I've kept it general as I want to be able to apply it in different cases
Data1 <- data.frame(DateTimes =as.POSIXct("1988-04-30 13:20:00")+c(1:10,12:15,20:30,5:13,16:20,22:35)*300,
Site = c(rep("SiteA",25),rep("SiteB",28)),
Quality = rep(25,53),
Value = round(runif(53,0,5),2),
Othermetadata = c(rep("E1",10),rep("E2",15),rep("E1",10),rep("E2",18)))
What I'm looking for is a simple way to group and aggregate this data to different timesteps while keeping metadata which doesn't vary within the group
I have tried using the zoo library and zoo::aggregate ie:
library(zoo)
zooData <- read.zoo(select(Data1, DateTimes, Value))
zooagg <- aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
However when I do this I'm losing all my metadata and merging different sites data.
I wondered about trying to use plyr or dplyr to split up the data and then appling the aggregate but I'm still going to lose my other columns.
Is there a better way to do this? I had a brief look at doco for xts library but couldn't see an intuitive solution in their either
*Note: as I want this to work for a few different things both the starting time step and final time step might change. With possibility for random time step, or somewhat regular time step but with missing points. And the FUN applied may vary (mostly sum or mean). As well as the fields I want to split it by *
Edit I found the solution after Hercules Apergis pushed me in the right direction.
newData <- Data1 %>% group_by(timeagg, Site) %>% summarise(Total = sum(Value))
finaldata <- inner_join(Data1,newData) %>% select(-DateTimes, - Value) %>% distinct()
The original DateTimes column wasn't a grouping variable - it was the time series, so I added a grouping variable of my aggregated time (here: time to the nearest hour) and summarised on this. Problem was if I joined on this new column I missed any points where there was time during that hour but not on the hour. Thus the inner_join %>% select %>% distinct method.
Now hopefully it works with my real data not eg data!
Given the function that you have on aggregation:
aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
You want to sum the values by group of times AND NOT lose other columns. You can simply do this with the dplyr package:
library(dplyr)
newdata <- Data1 %>% group_by(DateTimes) %>% summarise(sum(Value))
finaldata <- inner_join(Data1,newdata),by="DateTimes")
The newdata is a data.frame with each group of DateTimes has the Values summed. Then inner_join merges the parts that are common on those two datasets by the DateTimes variable. Since I am not entirely sure what your desired output is, this should be a good help for starters.

Resources