Creating vector of all individual correlation [duplicate] - r

This question already has answers here:
correlation between columns by group
(3 answers)
Closed 3 years ago.
I am trying to create a vector of all correlation of variables for each cross sectional unit, using dplyr approach returns an error as variables needs to be numeric.
I dont know how to solve it
What I need to end up with is a dataframe, that contains correlation between crmrte variable and all other explanatory variables BUT at cross-sectional level.
I need to specify code below:
cors <- crime %>%
group_by(county) %>%
summarize(cor = cor(crmrte, prbarr))
Update:
As suggested by Sotos, generalizing code above to be automatic I did this:
cors <- crime %>%
group_by(county) %>%
summarise_at(vars(names(crime)[4:ncol(crime)]), funs(cor(crmrte, .)))
But not sure if it is right approach

You can use summarise_at along with vars() which automatically quotes the names
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(names(mtcars)[6:10]), funs(cor(mpg, .)))
which gives,
# A tibble: 3 x 6
cyl wt qsec vs am gear
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 -0.713 -0.236 0.0488 0.536 0.339
2 6 -0.682 -0.419 -0.530 0.530 -0.00949
3 8 -0.650 -0.104 NA 0.0496 0.0496

Related

Is there a way to "summarize_by_group" without having to group_by the whole data each time?

I have a data frame with numerous variables I can group by.
I write a new chunk every time:
df %>% group_by(variable) %>% summarize()
Yet when I make a boxplot, I do not have to do this. I can simply add the groups in the function:
boxplot(df$numericvariable ~ df$variable_I_want_to_group_by, data=df)
This allows me in Rmarkdown to write all the different group_by's in the same chunk and view all the plots created next to each other.
I would like to find the same "group_by" as an integral part of a function for summarize (or an other function that does the same from a different package).
Expanding on the idea of writing a custom function so that you can quickly try lots of groupings, use the ... dots.
f <- function(...){
mtcars %>%
group_by(...) %>%
summarise(mean = mean(disp), n =n())
}
f(cyl)
f(cyl, gear)
You may use base R aggregate with a similar formula interface to boxplot,
aggregate(disp ~ cyl, mtcars, \(x) c(mean=mean(x), n=length(x)))
# cyl disp.mean disp.n
# 1 4 105.1364 11.0000
# 2 6 183.3143 7.0000
# 3 8 353.1000 14.0000
which will give you the same as dplyr.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(mean = mean(disp), n =n())
# # A tibble: 3 × 3
# cyl mean n
# <dbl> <dbl> <int>
# 1 4 105. 11
# 2 6 183. 7
# 3 8 353. 14

dplyr: group_by, sum various columns, and apply a function based on grouped row sums?

I'm trying to use dplyr to summarize a dataframe of bird species abundance in forests which are fragmented to some degree.
The first column, percent_cover, has 4 possible values: 10, 25, 50, 75. Then there are ten columns of bird species counts: 'species1' through 'species10'.
I want to group by percent_cover, then sum the other columns and calculate these sums as a percentage of the 4 row sums.
To get to the column sums is easy enough:
%>% group_by(Percent_cover) %>% summarise_at(vars(contains("species")), sum)
...but what I need is sum/rowSum*100. It seems that some kind of 'rowwise' operation is needed.
Also, out of interest, why does the following not work?
%>% group_by(Percent_cover) %>% summarise_at(vars(contains("species")), sum*100)
At this point, it's tempting to go back to 'for' loops....or Excel pivot tables.
To use dplyr, try the following :
library(dplyr)
df %>%
group_by(Percent_cover) %>%
summarise(across(contains("species"), sum)) %>%
mutate(rs = rowSums(select(., contains("species")))) %>%
mutate(across(contains('species'), ~./rs * 100)) -> result
result
For example, using mtcars :
mtcars %>%
group_by(cyl) %>%
summarise(across(disp:wt, sum)) %>%
mutate(rs = rowSums(select(., disp:wt))) %>%
mutate(across(disp:wt, ~./rs * 100))
# cyl disp hp drat wt rs
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 54.2 42.6 2.10 1.18 2135.
#2 6 58.7 39.2 1.15 0.998 2186.
#3 8 62.0 36.7 0.567 0.702 7974.

Use group size (`group_size`) in `summarise` in `dplyr` [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 3 years ago.
I want to use the size of a group as part of a groupwise operation in dplyr::summarise.
E.g calculate the proportion of manuals by cylinder, by grouping the cars data by cyl and dividing the number of manuals by the size of the group:
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/group_size(.))
But, (I think), because group_size is after a grouped tbl_df and . is ungrouped, this returns
Error in mutate_impl(.data, dots) : basic_string::resize
Is there a way to do this?
You probably can use n() to get the number of rows for group
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(zz = sum(am)/n())
# cyl zz
# <dbl> <dbl>
#1 4.00 0.727
#2 6.00 0.429
#3 8.00 0.143
It is just a group by mean
mtcars %>%
group_by(cyl) %>%
summarise(zz = mean(am))
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
If we need to use group_size
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest %>%
mutate(zz = map_dbl(data, ~ sum(.x$am)/group_size(.x))) %>%
arrange(cyl) %>%
select(-data)
# A tibble: 3 x 2
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143
Or using do
mtcars %>%
group_by(cyl) %>%
do(data.frame(zz = sum(.$am)/group_size(.)))
# A tibble: 3 x 2
# Groups: cyl [3]
# cyl zz
# <dbl> <dbl>
#1 4 0.727
#2 6 0.429
#3 8 0.143

Reshape results of summarize all efficiently [duplicate]

This question already has answers here:
use dplyr's summarise_each to return one row per function?
(3 answers)
Closed 4 years ago.
As part of my exploratory work I have built a function that provides a variety of metrics for each field in my dataset. I want to apply it to each column of my dataset.
library(tidyverse)
mtcars %>%summarise_all(., .funs = funs(mean, median, sd, max, min, n_distinct))
However, this results a dataset with 1 row and each function/column combination as a column. The names are also concatenated like 'column_function'.
DESIRED result would be a 'tidy' format like:
ORIGINAL_COLUMN_NAME | FUNCTION | RESULT
I'm guessing there has to be an easy way to do this?
Here is one option.
library(tidyverse)
mtcars %>%
gather(Original_Column, Value) %>%
group_by(Original_Column) %>%
summarise_all(., .funs = funs(mean, median, sd, max, min, n_distinct)) %>%
gather(Function, Result, -Original_Column)
# # A tibble: 66 x 3
# Original_Column Function Result
# <chr> <chr> <dbl>
# 1 am mean 0.406
# 2 carb mean 2.81
# 3 cyl mean 6.19
# 4 disp mean 231.
# 5 drat mean 3.60
# 6 gear mean 3.69
# 7 hp mean 147.
# 8 mpg mean 20.1
# 9 qsec mean 17.8
# 10 vs mean 0.438
# # ... with 56 more rows

dplyr: group mean centering (mutate + summarize)

What is the efficient/preferred way to do group mean centering with dplyr, that is take each element of a group (mutate) and perform an operation on it and a summary stat (summarize) for that group. Here's how one might do group mean centering on mtcars using base R:
do.call(rbind, lapply(split(mtcars, mtcars$cyl), function(x){
x[["cent"]] <- x$mpg - mean(x$mpg)
x
}))
You can try
library(dplyr)
mtcars %>%
add_rownames()%>% #if the rownames are needed as a column
group_by(cyl) %>%
mutate(cent= mpg-mean(mpg))
It appears that the above code use the global mean to center the mpg; how should I do if I want to center at the within group mean, i.e. the mean values of each cyl group level are different.
> mtcars %>%
+ add_rownames()%>% #if the rownames are needed as a column
+ group_by(cyl) %>%
+ mutate(cent= mpg-mean(mpg))%>%
+ dplyr ::select(cent)
Adding missing grouping variables: `cyl`
# A tibble: 32 x 2
# Groups: cyl [3]
cyl cent
<dbl> <dbl>
1 6 0.909
2 6 0.909
3 4 2.71
4 6 1.31
5 8 -1.39
6 6 -1.99
7 8 -5.79
8 4 4.31
9 4 2.71
10 6 -0.891
# … with 22 more rows
Warning message:
Deprecated, use tibble::rownames_to_column() instead.
> mtcars$mpg[1:5]-mean(mtcars$mpg)
[1] 0.909375 0.909375 2.709375 1.309375 -1.390625
You can try this instead (although the name of the new variable displayed is different):
mtcars %>%
group_by(cyl) %>%
mutate(gpcent = scale(mpg, scale = F))

Resources