Create dataframe with correlation and p-value by group? - r

I am trying to correlate several variables according to a specific group (COUNTY) in R. Although I am able to successfully find the correlation for each column through this method, I can't seem to find a way to save the p-value to the table for each group. Any suggestions?
Example Data:
crops <- data.frame(
COUNTY = sample(37001:37900),
CropYield = sample(c(1:100), 10, replace = TRUE),
MaxTemp =sample(c(40:80), 10, replace = TRUE),
precip =sample(c(0:10), 10, replace = TRUE),
ColdDays =sample(c(1:73), 10, replace = TRUE))
Example Code:
crops %>%
group_by(COUNTY) %>%
do(data.frame(Cor=t(cor(.[,2:5], .[,2]))))
^This gives me the correlation for each column but I need to know the p-value for each one as well. Ideally the final output would look like this.
Desired Output

You only have 1 observation per COUNTY, so it will not work.. I set more examples per COUNTY:
set.seed(111)
crops <- data.frame(
COUNTY = sample(37001:37002,10,replace=TRUE),
CropYield = sample(c(1:100), 10, replace = TRUE),
MaxTemp =sample(c(40:80), 10, replace = TRUE),
precip =sample(c(0:10), 10, replace = TRUE),
ColdDays =sample(c(1:73), 10, replace = TRUE))
I think you need to convert to a long format, and do a cor.test per COUNTY and variable
calcor=function(da){
data.frame(cor.test(da$CropYield,da$value)[c("estimate","p.value")])
}
crops %>%
pivot_longer(-c(COUNTY,CropYield)) %>%
group_by(COUNTY,name) %>% do(calcor(.))
# A tibble: 6 x 4
# Groups: COUNTY, name [6]
COUNTY name estimate p.value
<int> <chr> <dbl> <dbl>
1 37001 ColdDays 0.466 0.292
2 37001 MaxTemp -0.225 0.628
3 37001 precip -0.356 0.433
4 37002 ColdDays 0.888 0.304
5 37002 MaxTemp 0.941 0.220
6 37002 precip -0.489 0.674
The above gives you correlation for every variable against crop yield, for every county. Now it's a matter of converting it into wide format:
crops %>%
pivot_longer(-c(COUNTY,CropYield)) %>%
group_by(COUNTY,name) %>% do(calcor(.)) %>%
pivot_wider(values_from=c(estimate,p.value),names_from=name)
COUNTY estimate_ColdDa… estimate_MaxTemp estimate_precip p.value_ColdDays
<int> <dbl> <dbl> <dbl> <dbl>
1 37001 0.466 -0.225 -0.356 0.292
2 37002 0.888 0.941 -0.489 0.304
# … with 2 more variables: p.value_MaxTemp <dbl>, p.value_precip <dbl>

Related

Reordering rows alphabetically with specific exception(s) in R

This is my first Stack Overflow question so bear with me please. I'm trying to create dataframes that are ordered alphabetically based on a "Variable" field, with exceptions made for rows of particular values (e.g. "Avg. Temp" at the top of the dataframe and "Intercept" at the bottom of the dataframe). The starting dataframe might look like this, for example:
Variable Model 1 Estimate
Year=2009 0.026
Year=2010 -0.04
Year=2011 -0.135***
Age 0.033***
Avg Temp. -0.001***
Intercept -3.772***
Sex -0.073***
Year=2008 0.084***
Year=2012 -0.237***
Year=2013 -0.326***
Year=2014 -0.431***
Year=2015 -0.589***
And I want to reorder it as such:
Variable Model 1 Estimate
Avg Temp. -0.001***
Age 0.033***
Sex -0.073***
Year=2008 0.084***
Year=2009 0.026
Year=2010 -0.04
Year=2011 -0.135***
Year=2012 -0.237***
Year=2013 -0.326***
Year=2014 -0.431***
Year=2015 -0.589***
Intercept -3.772***
Appreciate any help on this.
You can use the fct_relevel() function from {forcats}. Its first call put Avg Temp., Age and Sex at the beginning (after = 0 by default). The second call will put Intercept at the end (n() refers to the numbers of line in the data frame).
library(tidyverse)
df <-
tribble(~Variable, ~Model,
"Year=2009", 0.026,
"Year=2010", -0.04,
"Year=2011", -0.135,
"Age", 0.033,
"Avg Temp.", -0.001,
"Intercept", -3.772,
"Sex", -0.073,
"Year=2008", 0.084,
"Year=2012", -0.237,
"Year=2013", -0.326,
"Year=2014", -0.431,
"Year=2015", -0.589)
df %>%
mutate(Variable = as.factor(Variable),
Variable = fct_relevel(Variable, "Avg Temp.", "Age", "Sex"),
Variable = fct_relevel(Variable, "Intercept", after = n())) %>%
arrange(Variable)
# A tibble: 12 × 2
Variable Model
<fct> <dbl>
1 Avg Temp. -0.001
2 Age 0.033
3 Sex -0.073
4 Year=2008 0.084
5 Year=2009 0.026
6 Year=2010 -0.04
7 Year=2011 -0.135
8 Year=2012 -0.237
9 Year=2013 -0.326
10 Year=2014 -0.431
11 Year=2015 -0.589
12 Intercept -3.77
Another option, in case the dataframes contain a variety of different variable names besides year and intercept, is something like this:
library(tidyverse)
# Sample data
df <- tribble(
~variable, ~model_1_estimate,
"Year=2009", "0.026",
"Year=2010", "-0.04",
"Year=2011", "-0.135***",
"Age", "0.033***",
"Avg Temp.", "-0.001***",
"Intercept", "-3.772***",
"Sex", "-0.073***",
"Year=2008", "0.084***",
"Year=2012", "-0.237***",
"Year=2013", "-0.326***",
"Year=2014", "-0.431***",
"Year=2015", "-0.589***"
)
# Possible solution
df |>
separate(variable, c("term", "year"), sep = "=") |>
mutate(intercept = if_else(term == "Intercept", 1, 0)) |>
arrange(intercept, term, year) |>
select(-intercept)
#> # A tibble: 12 × 3
#> term year model_1_estimate
#> <chr> <chr> <chr>
#> 1 Age <NA> 0.033***
#> 2 Avg Temp. <NA> -0.001***
#> 3 Sex <NA> -0.073***
#> 4 Year 2008 0.084***
#> 5 Year 2009 0.026
#> 6 Year 2010 -0.04
#> 7 Year 2011 -0.135***
#> 8 Year 2012 -0.237***
#> 9 Year 2013 -0.326***
#> 10 Year 2014 -0.431***
#> 11 Year 2015 -0.589***
#> 12 Intercept <NA> -3.772***
Created on 2022-06-28 by the reprex package (v2.0.1)

Choose dataframe variables by name and multiply with a vector elementwise

I have a data frame and a vector as follows:
my_df <- as.data.frame(
list(year = c(2001, 2001, 2001, 2001, 2001, 2001), month = c(1,
2, 3, 4, 5, 6), Pdt_d0 = c(0.379045935402736, 0.377328817455841,
0.341158889847019, 0.36761990427443, 0.372442657083218, 0.382702189949558
), Pdt_d1 = c(0.146034519173855, 0.166289573095497, 0.197787188740911,
0.137071647982617, 0.162103042313547, 0.168566518193772), Pdt_d2 = c(0.126975939811326,
0.107708783271871, 0.14096203677089, 0.142228236885706, 0.115542396064519,
0.106935751726809), Pdt_tot = c(2846715, 2897849.5, 2935406.25,
2850649, 2840313.75, 3087993.5))
)
my_vec <- 1:3
I want to multiply Pdt_d0:Pdt_d2 with the corresponding element from my_vec, while keeping the other columns untouched. I can get the desired multiplication with dplyr::select(my_df, num_range("Pdt_d", 0:2)) %>% mapply(``*``, ., my_vec) but I lose the year, month, Pdt_tot columns in the process. I tried to achieve my goal with dplyr::select(my_df, num_range("Pdt_d", 0:2)) <- dplyr::select(my_df, num_range("Pdt_d", 0:2)) %>% mapply(``*``, ., my_vec) which returns an error 'select<-' is not an exported object. Is there an obvious trick I am not seeing?
I don't think my question is a duplicate; I have seen the answers in here and here but neither question allows me to choose variables by name
You can use the left-hand-side overwritten by the right-hand-side Map/mapply logic, which you tried, outside of the tidy world:
vars <- paste0("Pdt_d", 0:2)
my_df[vars] <- Map(`*`, my_df[vars], my_vec)
my_df
# year month Pdt_d0 Pdt_d1 Pdt_d2 Pdt_tot
#1 2001 1 0.3790459 0.2920690 0.3809278 2846715
#2 2001 2 0.3773288 0.3325791 0.3231263 2897850
#3 2001 3 0.3411589 0.3955744 0.4228861 2935406
#4 2001 4 0.3676199 0.2741433 0.4266847 2850649
#5 2001 5 0.3724427 0.3242061 0.3466272 2840314
#6 2001 6 0.3827022 0.3371330 0.3208073 3087994
This works because [<- exists as a function in R, for assigning to a left-hand-side selection by the square brackets, like my_df[].
The error that was returned is because the code has a select() function on the left-hand-side, and there is no 'select<-' function. I.e., you can't assign to a select()-ion because it isn't setup to work like that. The tidy functions are usually expected to be piped like my_df %>% select() %>% etc without overwriting the original input.
I don't think that you want to do this mess, but it does work.
library(dplyr)
library(tidyr)
my_df %>%
gather(variable, value, -year,-month,-Pdt_tot) %>%
group_by(year, month, Pdt_tot) %>%
mutate(value = value * my_vector) %>%
spread(variable,value)
year month Pdt_tot Pdt_d0 Pdt_d1 Pdt_d2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 1 2846715 0.379 0.292 0.381
2 2001 2 2897850. 0.377 0.333 0.323
3 2001 3 2935406. 0.341 0.396 0.423
4 2001 4 2850649 0.368 0.274 0.427
5 2001 5 2840314. 0.372 0.324 0.347
6 2001 6 3087994. 0.383 0.337 0.321
Not specifying year, month, and Pdt_tot is,
my_df %>%
gather(variable, value, - !num_range("Pdt_d", 0:2)) %>%
group_by(across(c(-variable, -value))) %>%
mutate(value = value * my_vector) %>%
spread(variable, value)
year month Pdt_tot Pdt_d0 Pdt_d1 Pdt_d2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2001 1 2846715 0.379 0.292 0.381
2 2001 2 2897850. 0.377 0.333 0.323
3 2001 3 2935406. 0.341 0.396 0.423
4 2001 4 2850649 0.368 0.274 0.427
5 2001 5 2840314. 0.372 0.324 0.347
6 2001 6 3087994. 0.383 0.337 0.321

Weekly average of daily panel data in R

I have a large panel dataset of roughly 4million daily observations (Overview of my Dataset).
The variable symbol depicts the 952 different stocks contained in the data set and the other variables are some stock-related daily measures. I want to calculate the weekly averages of the variables rv, rskew, rkurt and rsj for each of the of the 952 stocks included in symbol.
I tried to group the dataset with group_by(symbol), but then I did not manage to aggregate the daily observations in the right way.
I am not very experienced with R and would highly appreciate some help here.
This is simple with the lubridate and dplyr packages:
library(dplyr)
library(lubridate)
set.seed(123)
df <- data.frame(date = seq.Date(ymd('2020-07-01'),ymd('2020-07-31'),by='day'),
sybol = 'a',
x = runif(31),
y = runif(31),
z = runif(31)
)
df <- df %>%
mutate(year = year(date),
week = week(date),
) %>%
group_by(year, week, symbol) %>%
summarise(x = mean(x),
y = mean(y),
z = mean(z)
)
> df
# A tibble: 5 x 6
# Groups: year, week [5]
year week symbol x y z
<dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 2020 27 a 0.555 0.552 0.620
2 2020 28 a 0.652 0.292 0.461
3 2020 29 a 0.495 0.350 0.398
4 2020 30 a 0.690 0.495 0.609
5 2020 31 a 0.466 0.378 0.376

How to compare technical duplicates on separate rows in R?

I would like to compare the mean, sd, and percentage CV of two technical duplicates in R.
Currently my data frame looks like this:
library(tidyverse)
data <- tribble(
~rowname, ~Sample, ~Phagocytic_Score,
1, 1232, 24030,
2, 1232, 11040,
3, 4321, 7266,
4, 4321, 4096,
5, 5631, 7383,
6, 5631, 21507
)
Created on 2019-10-22 by the reprex package (v0.3.0)
So I would want to compare the values from rows 1 and 2 together, 3 and 4 and so on.
With ideally this being stored in a new data frame just with the average score and stats if that makes sense.
Sorry I'm quite new to R so apoplogies if this is really straightforward.
Thanks! Mari
summarize() can give you exactly this, especially if all the stats you want are computed within groups defined by one variable, i.e. Sample:
library(raster)
#> Loading required package: sp
library(tidyverse)
data <- tribble(
~rowname, ~Sample, ~Phagocytic_Score,
1, 1232, 24030,
2, 1232, 11040,
3, 4321, 7266,
4, 4321, 4096,
5, 5631, 7383,
6, 5631, 21507
)
data %>%
group_by(Sample) %>%
summarize(
mean = mean(Phagocytic_Score),
sd = sd(Phagocytic_Score),
pct_cv = cv(Phagocytic_Score)
)
#> # A tibble: 3 x 4
#> Sample mean sd pct_cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
We've got some repeating going on, though, don't we? Each variable is defined as a function call with the same input variable. summarize_at() is more appropriate, then:
data %>%
group_by(Sample) %>%
summarize_at("Phagocytic_Score",
list(mean = mean, sd = sd, cv = cv))
#> # A tibble: 3 x 4
#> Sample mean sd cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
Ah, but there's still some more room for improvement. Why are we repeating the names of the functions as the names of the variables, since they're the same? Well, mget() will take a single vector of the function names we want, and return a named list of those functions, with the names as those function names:
data %>%
group_by(Sample) %>%
summarize_at("Phagocytic_Score",
mget(c("mean", "sd", "cv"), inherits = TRUE))
#> # A tibble: 3 x 4
#> Sample mean sd cv
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1232 17535 9185. 52.4
#> 2 4321 5681 2242. 39.5
#> 3 5631 14445 9987. 69.1
Note we need inherits = TRUE for the reason explained here.
Created on 2019-10-22 by the reprex package (v0.3.0)
If I'm understanding your question, you are looking to summarize your dataframe by grouping based on one of the columns. I assume that in your real data you don't always have exactly two observations of each of your samples.
This approach uses the tidyverse packages, there are other ways to accomplish the same thing
library(tidyverse)
df %>% # name of your data frame
group_by(Sample) %>% This puts all the observations with the same value under "Sample" into groups for subsequent analysis
summarize(Mean = mean(Phagocytic_Score),
SD = sd(Phagocytic_Score),
PercentCV = SD/Mean # using the sd and mean just calculated for each group
)

filtering data based on rank and conditions

I have some data which looks similar to the following:
# A tibble: 2,717 x 6
# Groups: date [60]
symbol date monthly.returns score totals score_rank
<chr> <date> <dbl> <dbl> <dbl> <int>
1 GIS 2010-01-29 0.0128 0.436 119. 2
2 GIS 2010-02-26 0.00982 0.205 120. 1
3 GIS 2010-03-31 -0.0169 0.549 51.1 3
4 GIS 2010-04-30 0.0123 0.860 28.0 4
5 GIS 2010-05-28 0.000984 0.888 91.6 4
6 GIS 2010-06-30 -0.00267 0.828 15.5 4
7 GIS 2010-07-30 -0.0297 0.482 81.7 2
8 GIS 2010-08-31 0.0573 0.408 57.2 3
9 GIS 2010-09-30 0.0105 0.887 93.3 4
10 GIS 2010-10-29 0.0357 0.111 96.6 1
# ... with 2,707 more rows
I have a score_rank, what I want to do is whenever the totals column is > 100 filter the data in the following way:
1) When the score_rank = 1, take the top 5% of observations based on the score column
2) When the score_rank = 2 or 3, take a random sample of 5% of the observations
3) When the score_rank = 4, take the bottom 5% of observations based on the score column.
Data:
tickers <- c("GIS", "KR", "MKC", "SJM", "EL", "HRL", "HSY", "K",
"KMB", "MDLZ", "MNST", "PEP", "PG", "PM", "SYY", "TAP", "TSN", "WBA", "WMT",
"MMM", "ABMD", "ACN", "AMD", "AES", "AON", "ANTM", "APA", "CSCO", "CMS", "KO", "GRMN", "GPS",
"JEC", "SJM", "JPM", "JNPR", "KSU", "KEYS", "KIM", "NBL", "NEM", "NWL", "NFLX", "NEE", "NOC", "TMO", "TXN", "TWTR")
library(tidyquant)
data <- tq_get(tickers,
get = "stock.prices", # Collect the stock price data from 2010 - 2015
from = "2010-01-01",
to = "2015-01-01") %>%
group_by(symbol) %>%
tq_transmute(select = adjusted, # Convert the data from daily prices to monthly prices
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic")
data$score <- runif(nrow(data), min = 0, max = 1)
data$totals <- runif(nrow(data), min = 10, max = 150)
data <- data %>%
group_by(date) %>%
mutate(
score_rank = ntile(score, 4)
)
Edit: Added code.
Here is one option to filter. Create a list of functions (fs) for each corresponding 'score_rank', use map2 to loop over the list functions and the corresponding 'score_rank' list of vectors, filter the 'data' where the 'totals' is greater than 100, and the 'score_rank' %in% the input from map2 vector, apply the function on 'score' column to filter the sample of rows and bind the subset data with the data filtered where 'totals' is less than or equal to 100
library(purrr)
library(dplyr)
fs <- list(as_mapper(~ . >= quantile(., prob = 0.95)),
as_mapper(~ row_number() %in% sample(row_number(), round(0.05 * n()) )),
as_mapper(~ . <= quantile(., prob = 0.05))
)
map2_df(list(1, c(2, 3), 4), fs, ~
data %>%
filter(totals > 100, score_rank %in% .x) %>%
filter(.y(score))
)%>% bind_rows(data %>%
filter(totals <= 100))

Resources