I have a large panel dataset of roughly 4million daily observations (Overview of my Dataset).
The variable symbol depicts the 952 different stocks contained in the data set and the other variables are some stock-related daily measures. I want to calculate the weekly averages of the variables rv, rskew, rkurt and rsj for each of the of the 952 stocks included in symbol.
I tried to group the dataset with group_by(symbol), but then I did not manage to aggregate the daily observations in the right way.
I am not very experienced with R and would highly appreciate some help here.
This is simple with the lubridate and dplyr packages:
library(dplyr)
library(lubridate)
set.seed(123)
df <- data.frame(date = seq.Date(ymd('2020-07-01'),ymd('2020-07-31'),by='day'),
sybol = 'a',
x = runif(31),
y = runif(31),
z = runif(31)
)
df <- df %>%
mutate(year = year(date),
week = week(date),
) %>%
group_by(year, week, symbol) %>%
summarise(x = mean(x),
y = mean(y),
z = mean(z)
)
> df
# A tibble: 5 x 6
# Groups: year, week [5]
year week symbol x y z
<dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 2020 27 a 0.555 0.552 0.620
2 2020 28 a 0.652 0.292 0.461
3 2020 29 a 0.495 0.350 0.398
4 2020 30 a 0.690 0.495 0.609
5 2020 31 a 0.466 0.378 0.376
Related
In variable type ,there are actual and budget values,how to add new variable and calculate the variable value ? Current code can work, but a little bording. Anyone can help? Thanks!
ori_data <- data.frame(
category=c("A","A","A","B","B","B"),
year=c(2021,2022,2022,2021,2022,2022),
type=c("actual","actual","budget","actual","actual","budget"),
sales=c(100,120,130,70,80,90),
profit=c(3.7,5.52,5.33,2.73,3.92,3.69)
)
Add sales inc%
ori_data$sales_inc_or_budget_acheved[category=='A'&year=='2022'&type=='actual'] <-
ori_data$sales[category=='A'&year=='2022'&type=='actual']/
ori_data$sales[category=='A'&year=='2021'&type=='actual']-1
Add budget acheved%
ori_data$sales_inc_or_budget_acheved[category=='A'&year=='2022'&type=='budget'] <-
ori_data$sales[category=='A'&year=='2022'&type=='actual']/
ori_data$sales[category=='A'&year=='2022'&type=='budget']
Using a group_by and an if_elseyou could do:
library(dplyr)
ori_data |>
group_by(category) |>
arrange(category, type, year) |>
mutate(sales_inc_or_budget_achieved = if_else(type == "actual",
sales / lag(sales) - 1,
lag(sales) / sales)) |>
ungroup()
#> # A tibble: 6 × 6
#> category year type sales profit sales_inc_or_budget_achieved
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 A 2021 actual 100 3.7 NA
#> 2 A 2022 actual 120 5.52 0.2
#> 3 A 2022 budget 130 5.33 0.923
#> 4 B 2021 actual 70 2.73 NA
#> 5 B 2022 actual 80 3.92 0.143
#> 6 B 2022 budget 90 3.69 0.889
And using across you could do the same for both sales and profit:
ori_data |>
group_by(category) |>
arrange(category, type, year) |>
mutate(across(c(sales, profit), ~ if_else(type == "actual",
.x / lag(.x) - 1,
lag(.x) / .x),
.names = "{.col}_inc_or_budget_achieved")) |>
ungroup()
#> # A tibble: 6 × 7
#> category year type sales profit sales_inc_or_budget_achie… profit_inc_or_b…
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2021 actual 100 3.7 NA NA
#> 2 A 2022 actual 120 5.52 0.2 0.492
#> 3 A 2022 budget 130 5.33 0.923 1.04
#> 4 B 2021 actual 70 2.73 NA NA
#> 5 B 2022 actual 80 3.92 0.143 0.436
#> 6 B 2022 budget 90 3.69 0.889 1.06
Answer from stefan suits perfectly well, however, I would suggest you rearrange your data first.
In my opinion sales and profit are types of measures (aka observations) and actual and budget are the measurements here:
library(tidyr)
library(dplyr)
ori_data2 <-
ori_data %>%
pivot_longer(c(sales, profit)) %>%
pivot_wider(names_from = type, values_from = value) %>%
group_by(category, name) %>%
arrange(year, .by_group = TRUE)
then your calculations become much more easier:
ori_data2 %>%
mutate(increase = actual / lag(actual) - 1, # compare to the year before
budget_acheved = actual / budget) %>% # compare actual vs. budget
filter(year == 2022) # you can filter for year of interest
mutate(across(c(increase, budget_acheved), scales::percent)) # and format as percent
This is my first Stack Overflow question so bear with me please. I'm trying to create dataframes that are ordered alphabetically based on a "Variable" field, with exceptions made for rows of particular values (e.g. "Avg. Temp" at the top of the dataframe and "Intercept" at the bottom of the dataframe). The starting dataframe might look like this, for example:
Variable Model 1 Estimate
Year=2009 0.026
Year=2010 -0.04
Year=2011 -0.135***
Age 0.033***
Avg Temp. -0.001***
Intercept -3.772***
Sex -0.073***
Year=2008 0.084***
Year=2012 -0.237***
Year=2013 -0.326***
Year=2014 -0.431***
Year=2015 -0.589***
And I want to reorder it as such:
Variable Model 1 Estimate
Avg Temp. -0.001***
Age 0.033***
Sex -0.073***
Year=2008 0.084***
Year=2009 0.026
Year=2010 -0.04
Year=2011 -0.135***
Year=2012 -0.237***
Year=2013 -0.326***
Year=2014 -0.431***
Year=2015 -0.589***
Intercept -3.772***
Appreciate any help on this.
You can use the fct_relevel() function from {forcats}. Its first call put Avg Temp., Age and Sex at the beginning (after = 0 by default). The second call will put Intercept at the end (n() refers to the numbers of line in the data frame).
library(tidyverse)
df <-
tribble(~Variable, ~Model,
"Year=2009", 0.026,
"Year=2010", -0.04,
"Year=2011", -0.135,
"Age", 0.033,
"Avg Temp.", -0.001,
"Intercept", -3.772,
"Sex", -0.073,
"Year=2008", 0.084,
"Year=2012", -0.237,
"Year=2013", -0.326,
"Year=2014", -0.431,
"Year=2015", -0.589)
df %>%
mutate(Variable = as.factor(Variable),
Variable = fct_relevel(Variable, "Avg Temp.", "Age", "Sex"),
Variable = fct_relevel(Variable, "Intercept", after = n())) %>%
arrange(Variable)
# A tibble: 12 × 2
Variable Model
<fct> <dbl>
1 Avg Temp. -0.001
2 Age 0.033
3 Sex -0.073
4 Year=2008 0.084
5 Year=2009 0.026
6 Year=2010 -0.04
7 Year=2011 -0.135
8 Year=2012 -0.237
9 Year=2013 -0.326
10 Year=2014 -0.431
11 Year=2015 -0.589
12 Intercept -3.77
Another option, in case the dataframes contain a variety of different variable names besides year and intercept, is something like this:
library(tidyverse)
# Sample data
df <- tribble(
~variable, ~model_1_estimate,
"Year=2009", "0.026",
"Year=2010", "-0.04",
"Year=2011", "-0.135***",
"Age", "0.033***",
"Avg Temp.", "-0.001***",
"Intercept", "-3.772***",
"Sex", "-0.073***",
"Year=2008", "0.084***",
"Year=2012", "-0.237***",
"Year=2013", "-0.326***",
"Year=2014", "-0.431***",
"Year=2015", "-0.589***"
)
# Possible solution
df |>
separate(variable, c("term", "year"), sep = "=") |>
mutate(intercept = if_else(term == "Intercept", 1, 0)) |>
arrange(intercept, term, year) |>
select(-intercept)
#> # A tibble: 12 × 3
#> term year model_1_estimate
#> <chr> <chr> <chr>
#> 1 Age <NA> 0.033***
#> 2 Avg Temp. <NA> -0.001***
#> 3 Sex <NA> -0.073***
#> 4 Year 2008 0.084***
#> 5 Year 2009 0.026
#> 6 Year 2010 -0.04
#> 7 Year 2011 -0.135***
#> 8 Year 2012 -0.237***
#> 9 Year 2013 -0.326***
#> 10 Year 2014 -0.431***
#> 11 Year 2015 -0.589***
#> 12 Intercept <NA> -3.772***
Created on 2022-06-28 by the reprex package (v2.0.1)
I am trying to correlate several variables according to a specific group (COUNTY) in R. Although I am able to successfully find the correlation for each column through this method, I can't seem to find a way to save the p-value to the table for each group. Any suggestions?
Example Data:
crops <- data.frame(
COUNTY = sample(37001:37900),
CropYield = sample(c(1:100), 10, replace = TRUE),
MaxTemp =sample(c(40:80), 10, replace = TRUE),
precip =sample(c(0:10), 10, replace = TRUE),
ColdDays =sample(c(1:73), 10, replace = TRUE))
Example Code:
crops %>%
group_by(COUNTY) %>%
do(data.frame(Cor=t(cor(.[,2:5], .[,2]))))
^This gives me the correlation for each column but I need to know the p-value for each one as well. Ideally the final output would look like this.
Desired Output
You only have 1 observation per COUNTY, so it will not work.. I set more examples per COUNTY:
set.seed(111)
crops <- data.frame(
COUNTY = sample(37001:37002,10,replace=TRUE),
CropYield = sample(c(1:100), 10, replace = TRUE),
MaxTemp =sample(c(40:80), 10, replace = TRUE),
precip =sample(c(0:10), 10, replace = TRUE),
ColdDays =sample(c(1:73), 10, replace = TRUE))
I think you need to convert to a long format, and do a cor.test per COUNTY and variable
calcor=function(da){
data.frame(cor.test(da$CropYield,da$value)[c("estimate","p.value")])
}
crops %>%
pivot_longer(-c(COUNTY,CropYield)) %>%
group_by(COUNTY,name) %>% do(calcor(.))
# A tibble: 6 x 4
# Groups: COUNTY, name [6]
COUNTY name estimate p.value
<int> <chr> <dbl> <dbl>
1 37001 ColdDays 0.466 0.292
2 37001 MaxTemp -0.225 0.628
3 37001 precip -0.356 0.433
4 37002 ColdDays 0.888 0.304
5 37002 MaxTemp 0.941 0.220
6 37002 precip -0.489 0.674
The above gives you correlation for every variable against crop yield, for every county. Now it's a matter of converting it into wide format:
crops %>%
pivot_longer(-c(COUNTY,CropYield)) %>%
group_by(COUNTY,name) %>% do(calcor(.)) %>%
pivot_wider(values_from=c(estimate,p.value),names_from=name)
COUNTY estimate_ColdDa… estimate_MaxTemp estimate_precip p.value_ColdDays
<int> <dbl> <dbl> <dbl> <dbl>
1 37001 0.466 -0.225 -0.356 0.292
2 37002 0.888 0.941 -0.489 0.304
# … with 2 more variables: p.value_MaxTemp <dbl>, p.value_precip <dbl>
Please see attached image of dataset.
What are the different ways to only retain a single value for each 'Month'? I've got a bunch of data points and would only need to retain, say, the mean value.
Many thanks
A different way of using the aggregate() function.
> aggregate(Temp ~ Month, data=airquality, FUN = mean)
Month Temp
1 5 65.54839
2 6 79.10000
3 7 83.90323
4 8 83.96774
5 9 76.90000
library(tidyverse)
library(lubridate)
#example data from airquality:
aq<-as_data_frame(airquality)
aq$mydate<-lubridate::ymd(paste0(2018, "-", aq$Month, "-", aq$Day))
> aq
# A tibble: 153 x 7
Ozone Solar.R Wind Temp Month Day mydate
<int> <int> <dbl> <int> <int> <int> <date>
1 41 190 7.40 67 5 1 2018-05-01
2 36 118 8.00 72 5 2 2018-05-02
3 12 149 12.6 74 5 3 2018-05-03
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE))
Summarize can return multiple summary functions:
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE),
"Num" = n(),
"SD" = sd(Temp, na.rm=TRUE))
# A tibble: 5 x 4
Month Mean_Temp Num SD
<dbl> <dbl> <int> <dbl>
1 5.00 65.5 31 6.85
2 6.00 79.1 30 6.60
3 7.00 83.9 31 4.32
4 8.00 84.0 31 6.59
5 9.00 76.9 30 8.36
Lubridate Cheatsheet
A data.table answer:
# load libraries
library(data.table)
library(lubridate)
setDT(dt)
dt[, .(meanValue = mean(value, na.rm =TRUE)), by = .(monthDate = floor_date(dates, "month"))]
Where dt has at least columns value and dates.
We can group by the index of dataset, use that in aggregate (from base R) to get the mean
aggregate(dat, index(dat), FUN = mean)
NB: Here, we assumed that the dataset is xts or zoo format. If the dataset have a month column, then use
aggregate(dat, list(dat$Month), FUN = mean)
I am not very experienced with loops so I am not sure where I went wrong here...
I have a dataframe that looks like:
month year day mean.temp mean.temp.year.month
1 1961 1 4.85 4.090323
1 1961 2 4.90 4.090323
1 1961 3 2.95 4.090323
1 1961 4 3.40 4.090323
1 1961 5 2.90 4.090323
dataset showing 3 months for 2 years can be found here:
https://drive.google.com/file/d/1w7NVeoEh8b7cAkU3cu1sXx6yCh75Inqg/view?usp=sharing
and I want to subset this dataframe by year and month so that I can run one nls model per year and month. Since my dataset contains 56 years (and each year has 12 months), that will give 672 models. Then I want to store the parameter estimates in a separate table.
I've created this code, but I can't work out why it is only giving me the parameter estimates for month 12 (all 56 years, but just month 12):
table <- matrix(99999, nrow=672, ncol=4)
YEARMONTHsel <- unique(df_weather[c("year", "month")])
YEARsel <- unique(df_weather$year)
MONTHsel <- unique(df_weather$month)
for (i in 1:length(YEARsel)) {
for (j in 1:length(MONTHsel)) {
temp2 <- df_weather[df_weather$year==YEARsel[i] & df_weather$month==MONTHsel[j],]
mn <- nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = temp2, control=nlc,
start=list(alpha=-6.07043, phi = -10))
cr <- as.vector(coef(mn))
nv <-length(coef(mn))
table[i,1:nv] <- cr
table[i,nv+1]<- YEARsel[i]
table[i,nv+2]<- MONTHsel[j]
}
}
I've tried several options (i.e. without using nested loop) but I'm not getting anywhere.
Any help would be greatly appreciated!Thanks.
Based on your loop, it looks like you want to run the regression grouped by year and month and then extract the coefficients in a new dataframe (correct me if thats wrong)
library(readxl)
library(tidyverse)
df <- read_excel("~/Downloads/df_weather.xlsx")
df %>% nest(-month, -year) %>%
mutate(model = map(data, ~nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = .x, control= "nlc",
start=list(alpha=-6.07043, phi = -10))),
coeff = map(model, ~coefficients(.x))) %>%
unnest(coeff %>% map(broom::tidy)) %>%
spread(names, x) %>%
arrange(year)
#> # A tibble: 6 x 4
#> month year alpha phi
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1961 0.561 -10.8
#> 2 2 1961 -1.50 -10.5
#> 3 3 1961 -2.06 -9.77
#> 4 1 1962 -3.35 -5.48
#> 5 2 1962 -2.27 -9.97
#> 6 3 1962 0.959 -10.8
First we nest the data based on your groups (in this case year and month), then we map the model for each group, then we map the coefficients for each group, lastly we unnest the coefficients and spread the data from long to wide.