This is my first Stack Overflow question so bear with me please. I'm trying to create dataframes that are ordered alphabetically based on a "Variable" field, with exceptions made for rows of particular values (e.g. "Avg. Temp" at the top of the dataframe and "Intercept" at the bottom of the dataframe). The starting dataframe might look like this, for example:
Variable Model 1 Estimate
Year=2009 0.026
Year=2010 -0.04
Year=2011 -0.135***
Age 0.033***
Avg Temp. -0.001***
Intercept -3.772***
Sex -0.073***
Year=2008 0.084***
Year=2012 -0.237***
Year=2013 -0.326***
Year=2014 -0.431***
Year=2015 -0.589***
And I want to reorder it as such:
Variable Model 1 Estimate
Avg Temp. -0.001***
Age 0.033***
Sex -0.073***
Year=2008 0.084***
Year=2009 0.026
Year=2010 -0.04
Year=2011 -0.135***
Year=2012 -0.237***
Year=2013 -0.326***
Year=2014 -0.431***
Year=2015 -0.589***
Intercept -3.772***
Appreciate any help on this.
You can use the fct_relevel() function from {forcats}. Its first call put Avg Temp., Age and Sex at the beginning (after = 0 by default). The second call will put Intercept at the end (n() refers to the numbers of line in the data frame).
library(tidyverse)
df <-
tribble(~Variable, ~Model,
"Year=2009", 0.026,
"Year=2010", -0.04,
"Year=2011", -0.135,
"Age", 0.033,
"Avg Temp.", -0.001,
"Intercept", -3.772,
"Sex", -0.073,
"Year=2008", 0.084,
"Year=2012", -0.237,
"Year=2013", -0.326,
"Year=2014", -0.431,
"Year=2015", -0.589)
df %>%
mutate(Variable = as.factor(Variable),
Variable = fct_relevel(Variable, "Avg Temp.", "Age", "Sex"),
Variable = fct_relevel(Variable, "Intercept", after = n())) %>%
arrange(Variable)
# A tibble: 12 × 2
Variable Model
<fct> <dbl>
1 Avg Temp. -0.001
2 Age 0.033
3 Sex -0.073
4 Year=2008 0.084
5 Year=2009 0.026
6 Year=2010 -0.04
7 Year=2011 -0.135
8 Year=2012 -0.237
9 Year=2013 -0.326
10 Year=2014 -0.431
11 Year=2015 -0.589
12 Intercept -3.77
Another option, in case the dataframes contain a variety of different variable names besides year and intercept, is something like this:
library(tidyverse)
# Sample data
df <- tribble(
~variable, ~model_1_estimate,
"Year=2009", "0.026",
"Year=2010", "-0.04",
"Year=2011", "-0.135***",
"Age", "0.033***",
"Avg Temp.", "-0.001***",
"Intercept", "-3.772***",
"Sex", "-0.073***",
"Year=2008", "0.084***",
"Year=2012", "-0.237***",
"Year=2013", "-0.326***",
"Year=2014", "-0.431***",
"Year=2015", "-0.589***"
)
# Possible solution
df |>
separate(variable, c("term", "year"), sep = "=") |>
mutate(intercept = if_else(term == "Intercept", 1, 0)) |>
arrange(intercept, term, year) |>
select(-intercept)
#> # A tibble: 12 × 3
#> term year model_1_estimate
#> <chr> <chr> <chr>
#> 1 Age <NA> 0.033***
#> 2 Avg Temp. <NA> -0.001***
#> 3 Sex <NA> -0.073***
#> 4 Year 2008 0.084***
#> 5 Year 2009 0.026
#> 6 Year 2010 -0.04
#> 7 Year 2011 -0.135***
#> 8 Year 2012 -0.237***
#> 9 Year 2013 -0.326***
#> 10 Year 2014 -0.431***
#> 11 Year 2015 -0.589***
#> 12 Intercept <NA> -3.772***
Created on 2022-06-28 by the reprex package (v2.0.1)
Related
I have a data frame and a vector as follows:
my_df <- as.data.frame(
list(year = c(2001, 2001, 2001, 2001, 2001, 2001), month = c(1,
2, 3, 4, 5, 6), Pdt_d0 = c(0.379045935402736, 0.377328817455841,
0.341158889847019, 0.36761990427443, 0.372442657083218, 0.382702189949558
), Pdt_d1 = c(0.146034519173855, 0.166289573095497, 0.197787188740911,
0.137071647982617, 0.162103042313547, 0.168566518193772), Pdt_d2 = c(0.126975939811326,
0.107708783271871, 0.14096203677089, 0.142228236885706, 0.115542396064519,
0.106935751726809), Pdt_tot = c(2846715, 2897849.5, 2935406.25,
2850649, 2840313.75, 3087993.5))
)
my_vec <- 1:3
I want to multiply Pdt_d0:Pdt_d2 with the corresponding element from my_vec, while keeping the other columns untouched. I can get the desired multiplication with dplyr::select(my_df, num_range("Pdt_d", 0:2)) %>% mapply(``*``, ., my_vec) but I lose the year, month, Pdt_tot columns in the process. I tried to achieve my goal with dplyr::select(my_df, num_range("Pdt_d", 0:2)) <- dplyr::select(my_df, num_range("Pdt_d", 0:2)) %>% mapply(``*``, ., my_vec) which returns an error 'select<-' is not an exported object. Is there an obvious trick I am not seeing?
I don't think my question is a duplicate; I have seen the answers in here and here but neither question allows me to choose variables by name
You can use the left-hand-side overwritten by the right-hand-side Map/mapply logic, which you tried, outside of the tidy world:
vars <- paste0("Pdt_d", 0:2)
my_df[vars] <- Map(`*`, my_df[vars], my_vec)
my_df
# year month Pdt_d0 Pdt_d1 Pdt_d2 Pdt_tot
#1 2001 1 0.3790459 0.2920690 0.3809278 2846715
#2 2001 2 0.3773288 0.3325791 0.3231263 2897850
#3 2001 3 0.3411589 0.3955744 0.4228861 2935406
#4 2001 4 0.3676199 0.2741433 0.4266847 2850649
#5 2001 5 0.3724427 0.3242061 0.3466272 2840314
#6 2001 6 0.3827022 0.3371330 0.3208073 3087994
This works because [<- exists as a function in R, for assigning to a left-hand-side selection by the square brackets, like my_df[].
The error that was returned is because the code has a select() function on the left-hand-side, and there is no 'select<-' function. I.e., you can't assign to a select()-ion because it isn't setup to work like that. The tidy functions are usually expected to be piped like my_df %>% select() %>% etc without overwriting the original input.
I don't think that you want to do this mess, but it does work.
library(dplyr)
library(tidyr)
my_df %>%
gather(variable, value, -year,-month,-Pdt_tot) %>%
group_by(year, month, Pdt_tot) %>%
mutate(value = value * my_vector) %>%
spread(variable,value)
year month Pdt_tot Pdt_d0 Pdt_d1 Pdt_d2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 1 2846715 0.379 0.292 0.381
2 2001 2 2897850. 0.377 0.333 0.323
3 2001 3 2935406. 0.341 0.396 0.423
4 2001 4 2850649 0.368 0.274 0.427
5 2001 5 2840314. 0.372 0.324 0.347
6 2001 6 3087994. 0.383 0.337 0.321
Not specifying year, month, and Pdt_tot is,
my_df %>%
gather(variable, value, - !num_range("Pdt_d", 0:2)) %>%
group_by(across(c(-variable, -value))) %>%
mutate(value = value * my_vector) %>%
spread(variable, value)
year month Pdt_tot Pdt_d0 Pdt_d1 Pdt_d2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 1 2846715 0.379 0.292 0.381
2 2001 2 2897850. 0.377 0.333 0.323
3 2001 3 2935406. 0.341 0.396 0.423
4 2001 4 2850649 0.368 0.274 0.427
5 2001 5 2840314. 0.372 0.324 0.347
6 2001 6 3087994. 0.383 0.337 0.321
I'm working with a large dataframe 7191 obs. of 19 variables. The columns are Month, Day, Year, and Site1 through Site16. Where Month is either June, July, August, September, or October.
Here is the beginning of my data, which I believe has only numerals in the columns site1-site16. Currently, I'm double checking to make sure.
dput(head(No_PS_for_Calculations ))
structure(list(Month = c("June", "June", "June", "June", "June",
"June"), Day = c(1, 2, 3, 4, 5, 6), Year = c(1970, 1970, 1970,
1970, 1970, 1970), Site1 = c("11.531", "12.298", "12.732", "12.619",
"12.5", "13.201"), Site2 = c("11.185", "11.439", "12.17", "12.432",
"12.337", "12.492"), Site3 = c("11.147", "11.496", "11.645",
"12.208", "12.644", "12.971"), Site4 = c("11.393", "11.707",
"11.961", "12.135", "12.809", "13.041"), Site5 = c("11.797",
"11.925", "12.34", "12.525", "13.01", "13.548"), Site6 = c("11.853",
"11.974", "12.16", "12.481", "12.459", "12.838"), Site7 = c("12.319",
"12.46", "12.476", "12.729", "13.026", "13.032"), Site8 = c("12.557",
"12.643", "12.789", "12.975", "13.202", "13.339"), Site9 = c("12.774",
"13.337", "13.896", "13.897", "13.819", "14.054"), Site10 = c("12.819",
"13.202", "13.783", "14.298", "14.284", "14.309"), Site11 = c("13.151",
"13.556", "13.833", "14.08", "14.244", "14.841"), Site12 = c("13.61",
"13.57", "14.111", "14.073", "14.331", "14.849"), Site13 = c("13.802",
"13.872", "14.244", "14.249", "14.255", "14.818"), Site14 = c("14.138",
"14.275", "14.332", "14.522", "14.244", "14.927"), Site15 = c("14.138",
"14.616", "14.766", "14.697", "14.61", "14.694"), Site16 = c("14.208",
"14.627", "14.928", "14.829", "14.69", "14.762")), row.names = 151:156, class = "data.frame")
For my analysis I am interested in finding the 90th percentile for each month in each year. For example for 1970, I need the 90th percentile for June, July, August, September, and October. I've tried a few different ways but keep getting stuck in the same spot so I thought I'd ask for help.
result <- No_PS_for_Calculations %>%
group_by(Year, Month) %>%
summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
data.frame(result)
Which results in the following error:
Error: Problem with `summarise()` input `..1`.
i `..1 = across(Site1:Site16, quantile, probs = 0.9, .names = "percent90_{col}")`.
x non-numeric argument to binary operator
i The error occurred in group 1: Year = 1970, Month = "August".
I've been able to find the percentile grouped by month but now need to include year for further analysis.
What is the best way to get the 90th Percentiles presented by year and then month?
Thanks for the help!
It seems likely that you have something non-numeric in a column between Site1 and Site16. Some fake data:
set.seed(42)
No_PS_for_Calculations <- data.frame(Year = rep(2020:2021, each = 3), Month = rep(c("Aug","Sep","Oct"), times = 2), Site1 = runif(6), Quux = sprintf("%0.03f", runif(6)), Site16 = runif(6))
No_PS_for_Calculations
# Year Month Site1 Quux Site16
# 1 2020 Aug 0.9148060 0.737 0.9346722
# 2 2020 Sep 0.9370754 0.135 0.2554288
# 3 2020 Oct 0.2861395 0.657 0.4622928
# 4 2021 Aug 0.8304476 0.705 0.9400145
# 5 2021 Sep 0.6417455 0.458 0.9782264
# 6 2021 Oct 0.5190959 0.719 0.1174874
No_PS_for_Calculations %>%
group_by(Year, Month) %>%
summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
+ > Error: Problem with `summarise()` input `..1`.
# x non-numeric argument to binary operator
# i Input `..1` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
# i The error occurred in group 1: Year = 2020, Month = "Aug".
If the non-numeric data ("Quux" column here) is not meant to be summarized, then you can select the columns you need to avoid any confusion:
No_PS_for_Calculations %>%
select(Year, Month, starts_with("Site")) %>%
group_by(Year, Month) %>%
summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
# # A tibble: 6 x 4
# # Groups: Year [2]
# Year Month percent90_Site1 percent90_Site16
# <int> <chr> <dbl> <dbl>
# 1 2020 Aug 0.915 0.737
# 2 2020 Oct 0.286 0.657
# 3 2020 Sep 0.937 0.135
# 4 2021 Aug 0.830 0.705
# 5 2021 Oct 0.519 0.719
# 6 2021 Sep 0.642 0.458
Another cause might be if a legitimate Site column is non-numeric, in which case you need to determine if you can easily convert to numeric. For instance, if "Quux" here is instead named "Site2"
names(No_PS_for_Calculations)[4] <- "Site2"
then we can try to convert it inline:
No_PS_for_Calculations %>%
mutate(Site2 = as.numeric(Site2)) %>%
group_by(Year, Month) %>%
summarise(across(Site1:`Site16`, quantile, probs = .9, .names = 'percent90_{col}'))
# # A tibble: 6 x 5
# # Groups: Year [2]
# Year Month percent90_Site1 percent90_Site2 percent90_Site16
# <int> <chr> <dbl> <dbl> <dbl>
# 1 2020 Aug 0.915 0.737 0.935
# 2 2020 Oct 0.286 0.657 0.462
# 3 2020 Sep 0.937 0.135 0.255
# 4 2021 Aug 0.830 0.705 0.940
# 5 2021 Oct 0.519 0.719 0.117
# 6 2021 Sep 0.642 0.458 0.978
Of course, if there are non-number characters in there, you will get NAs, which is easily fixed given filters, cleaners, or similar.
I am running into some problems again and hope that someone can help me. I am doing research on the effect of ELI on ROS for firms and if the pandemic has an effect on this. For this research, my supervisor for my thesis has asked me to do a regression analysis per year grouped by industries (NAICS) and I am at a loss as to how to do this. I have firms in 46 different industries (NAICS) and 11 years of firm data per firm (2010-2020). Now I would like to run a regression ROS ~ ELI + ELI*Pandemic, for all industries for each year and then capture the resulting N (number of firms per industry) and R-squared in one file. The image below is an example of what I am trying to achieve:
I hope that someone can help me because I am at an absolute loss and I can't seem to find a similar question/answer on SO.
Here is the dput(head()) as an example. NAICS is the industry.
df <- structure(list(NAICS = c(315, 315, 315, 315, 315, 315),
Year = c(2010, 2011, 2012, 2013, 2014, 2015),
Firm = c("A", "A", "A", "A", "A", "A"),
ROS = c(0.17, 0.19, 0.29, 0.3, 0.29, 0.25),
ELI = c(0.856264428748774, 0.723379402777553, 0.958341156943977, 0.680567730897854, 0.790480861209701, 0.827279134948296),
Pandemic = c(0, 0, 0, 0, 0, 0)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
Update02
I have made the necessary modifications on my solution after I received the original data set and I don't there will be any other problems.
library(dplyr)
library(tidyr)
library(broom)
library(purrr)
df %>%
group_by(NAICS, Year) %>%
add_count(name = "N") %>%
nest(data = !c(NAICS, Year, N)) %>%
mutate(models = map(data, ~ lm(ROS ~ ELI + ELI * Pandemic, data = .)),
glance = map(models, ~ glance(.x)),
tidied = map(models, ~ tidy(.x))) %>%
unnest(glance) %>%
select(NAICS:N, r.squared, tidied) %>%
unnest(tidied)
# A tibble: 2,024 x 9
# Groups: NAICS, Year [506]
NAICS Year N r.squared term estimate std.error statistic p.value
<dbl> <dbl> <int> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 315 2010 12 0.122 (Intercept) 0.0959 0.0123 7.83 0.0000143
2 315 2010 12 0.122 ELI 0.0189 0.0160 1.18 0.266
3 315 2010 12 0.122 Pandemic NA NA NA NA
4 315 2010 12 0.122 ELI:Pandemic NA NA NA NA
5 315 2011 12 0.129 (Intercept) 0.0999 0.0115 8.70 0.00000559
6 315 2011 12 0.129 ELI 0.0161 0.0132 1.22 0.251
7 315 2011 12 0.129 Pandemic NA NA NA NA
8 315 2011 12 0.129 ELI:Pandemic NA NA NA NA
9 315 2012 13 0.594 (Intercept) -0.486 0.606 -0.802 0.439
10 315 2012 13 0.594 ELI 2.11 0.526 4.01 0.00205
# ... with 2,014 more rows
I have a large panel dataset of roughly 4million daily observations (Overview of my Dataset).
The variable symbol depicts the 952 different stocks contained in the data set and the other variables are some stock-related daily measures. I want to calculate the weekly averages of the variables rv, rskew, rkurt and rsj for each of the of the 952 stocks included in symbol.
I tried to group the dataset with group_by(symbol), but then I did not manage to aggregate the daily observations in the right way.
I am not very experienced with R and would highly appreciate some help here.
This is simple with the lubridate and dplyr packages:
library(dplyr)
library(lubridate)
set.seed(123)
df <- data.frame(date = seq.Date(ymd('2020-07-01'),ymd('2020-07-31'),by='day'),
sybol = 'a',
x = runif(31),
y = runif(31),
z = runif(31)
)
df <- df %>%
mutate(year = year(date),
week = week(date),
) %>%
group_by(year, week, symbol) %>%
summarise(x = mean(x),
y = mean(y),
z = mean(z)
)
> df
# A tibble: 5 x 6
# Groups: year, week [5]
year week symbol x y z
<dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 2020 27 a 0.555 0.552 0.620
2 2020 28 a 0.652 0.292 0.461
3 2020 29 a 0.495 0.350 0.398
4 2020 30 a 0.690 0.495 0.609
5 2020 31 a 0.466 0.378 0.376
I am not very experienced with loops so I am not sure where I went wrong here...
I have a dataframe that looks like:
month year day mean.temp mean.temp.year.month
1 1961 1 4.85 4.090323
1 1961 2 4.90 4.090323
1 1961 3 2.95 4.090323
1 1961 4 3.40 4.090323
1 1961 5 2.90 4.090323
dataset showing 3 months for 2 years can be found here:
https://drive.google.com/file/d/1w7NVeoEh8b7cAkU3cu1sXx6yCh75Inqg/view?usp=sharing
and I want to subset this dataframe by year and month so that I can run one nls model per year and month. Since my dataset contains 56 years (and each year has 12 months), that will give 672 models. Then I want to store the parameter estimates in a separate table.
I've created this code, but I can't work out why it is only giving me the parameter estimates for month 12 (all 56 years, but just month 12):
table <- matrix(99999, nrow=672, ncol=4)
YEARMONTHsel <- unique(df_weather[c("year", "month")])
YEARsel <- unique(df_weather$year)
MONTHsel <- unique(df_weather$month)
for (i in 1:length(YEARsel)) {
for (j in 1:length(MONTHsel)) {
temp2 <- df_weather[df_weather$year==YEARsel[i] & df_weather$month==MONTHsel[j],]
mn <- nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = temp2, control=nlc,
start=list(alpha=-6.07043, phi = -10))
cr <- as.vector(coef(mn))
nv <-length(coef(mn))
table[i,1:nv] <- cr
table[i,nv+1]<- YEARsel[i]
table[i,nv+2]<- MONTHsel[j]
}
}
I've tried several options (i.e. without using nested loop) but I'm not getting anywhere.
Any help would be greatly appreciated!Thanks.
Based on your loop, it looks like you want to run the regression grouped by year and month and then extract the coefficients in a new dataframe (correct me if thats wrong)
library(readxl)
library(tidyverse)
df <- read_excel("~/Downloads/df_weather.xlsx")
df %>% nest(-month, -year) %>%
mutate(model = map(data, ~nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = .x, control= "nlc",
start=list(alpha=-6.07043, phi = -10))),
coeff = map(model, ~coefficients(.x))) %>%
unnest(coeff %>% map(broom::tidy)) %>%
spread(names, x) %>%
arrange(year)
#> # A tibble: 6 x 4
#> month year alpha phi
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1961 0.561 -10.8
#> 2 2 1961 -1.50 -10.5
#> 3 3 1961 -2.06 -9.77
#> 4 1 1962 -3.35 -5.48
#> 5 2 1962 -2.27 -9.97
#> 6 3 1962 0.959 -10.8
First we nest the data based on your groups (in this case year and month), then we map the model for each group, then we map the coefficients for each group, lastly we unnest the coefficients and spread the data from long to wide.