How to calculate the sum of rows by each gvkey respectively?

How to calculate the sum of rows by each gvkey respectively? - r

I tried to calculate the cumulative sum of the twitter followers for each gvkey respectively ,and I use the group_by function,but the output is still the sum of the entire column,I suppose it is the problem of the " for (i in i:nrow(premod_e))
predmod_e <- predmod_e %>%
arrange(gvkey, date) %>%#arrange the gvkey and date
group_by(gvkey)#use group_by for respective calculation
for (i in 1:nrow(predmod_e)) {
predmod_e[i+1,]$x <- predmod_e[i+1,]$x + predmod_e[i,]$x
}#for loop to calculate

Perhaps just this:
predmod_e <- predmod_e %>%
arrange(gvkey, date) %>%
group_by(gvkey) %>%
mutate(newx = cumsum(x))
If you want to do something with the groups yourself (i.e., not with a dplyr verb), then you should use the groups as they are "known" by the tidy verbs. Luckily, they are merely stored as an attribute:
mtcars %>%
group_by(cyl) %>%
attr(., "groups")
# # A tibble: 3 x 2
# cyl .rows
# <dbl> <list>
# 1 4 <int [11]>
# 2 6 <int [7]>
# 3 8 <int [14]>

Related

Is there a way to "summarize_by_group" without having to group_by the whole data each time?

I have a data frame with numerous variables I can group by.
I write a new chunk every time:
df %>% group_by(variable) %>% summarize()
Yet when I make a boxplot, I do not have to do this. I can simply add the groups in the function:
boxplot(df$numericvariable ~ df$variable_I_want_to_group_by, data=df)
This allows me in Rmarkdown to write all the different group_by's in the same chunk and view all the plots created next to each other.
I would like to find the same "group_by" as an integral part of a function for summarize (or an other function that does the same from a different package).

Expanding on the idea of writing a custom function so that you can quickly try lots of groupings, use the ... dots.
f <- function(...){
mtcars %>%
group_by(...) %>%
summarise(mean = mean(disp), n =n())
}
f(cyl)
f(cyl, gear)

You may use base R aggregate with a similar formula interface to boxplot,
aggregate(disp ~ cyl, mtcars, \(x) c(mean=mean(x), n=length(x)))
# cyl disp.mean disp.n
# 1 4 105.1364 11.0000
# 2 6 183.3143 7.0000
# 3 8 353.1000 14.0000
which will give you the same as dplyr.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(mean = mean(disp), n =n())
# # A tibble: 3 × 3
# cyl mean n
# <dbl> <dbl> <int>
# 1 4 105. 11
# 2 6 183. 7
# 3 8 353. 14

creating dataframe of stats from another dataframe in R

I have the following code.
n_manu <- mpg %>% dplyr::select(manufacturer) %>% n_distinct()
n_model <- mpg %>% dplyr::select(model) %>% n_distinct()
n_year <- mpg %>% dplyr::select(year) %>% n_distinct()
I want to put this in a dataframe that looks like so:
Is there a way I can do this elegantly without 3 lines of code for calculating the distinct valuse?
stat value
n_manu 15
n_model 38
n_year 2

library(tidyverse)
mpg %>%
summarise(across(c(manufacturer, model, year), n_distinct))
Gives
# A tibble: 1 × 3
manufacturer model year
<int> <int> <int>
1 15 38 2
and
mpg %>%
summarise(across(c(manufacturer, model, year), n_distinct)) %>%
pivot_longer(everything(), names_to="stat")
# A tibble: 3 × 2
stat value
<chr> <int>
1 manufacturer 15
2 model 38
3 year 2
From there you can finesse "row labels" with ease.
To save the results as a dataframe, simply assign the result of the pipe to an object:
summaryStats <- mpg %>%
summarise(across(c(manufacturer, model, year), n_distinct)) %>%
pivot_longer(everything(), names_to="stat")

How do I create a function to mutate new columns with a variable name and "_pct"?

Using mtcars as an example. I would like to write a function that creates a count and pct column such as below -
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(cyl_pct = count/sum(count))
This produces the output -
# A tibble: 3 x 3
cyl count mpg_pct
<dbl> <int> <dbl>
1 4 11 0.344
2 6 7 0.219
3 8 14 0.438
However, I would like to create a function where I can specify the group_by column to be any column and the mutate column will be name the column name specified in the groub_by, and a _pct. So if I want to use disp, disp will be my group_by variable and the function will mutate a disp_pct column.

Similar to akrun's answer, but using {{ instead of !!:
foo = function(data, col) {
data %>%
group_by({{col}}) %>%
summarize(count = n()) %>%
ungroup %>%
mutate(
"{{col}}_pct" := count / sum(count)
)
}
foo(mtcars, cyl)
# `summarise()` ungrouping output (override with `.groups` argument)
# # A tibble: 3 x 3
# cyl count cyl_pct
# <dbl> <int> <dbl>
# 1 4 11 0.344
# 2 6 7 0.219
# 3 8 14 0.438

Assuming that the input is unquoted, convert to symbol with ensym, evaluate (!!) within group_by while converting the symbol into a string (as_string) and paste the prefix '_pct' for the new column name. In mutate we can use := along with !! to assign the column name from the object created ('colnm')
library(stringr)
library(dplyr)
f1 <- function(dat, grp) {
grp <- ensym(grp)
colnm <- str_c(rlang::as_string(grp), '_pct')
dat %>%
group_by(!!grp) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(!! colnm := count/sum(count))
}
-testing
f1(mtcars, cyl)
# A tibble: 3 x 3
# cyl count cyl_pct
# <dbl> <int> <dbl>
#1 4 11 0.344
#2 6 7 0.219
#3 8 14 0.438

This is probably no different than the one posted by my dear friend #akrun. However, in my version I used enquo function instead of ensym.
There is actually a subtle difference between the two and I thought you might be interested to know:
As per documentation of nse-defuse, ensym returns a raw expression whereas enquo returns a "quosure" which is in fact a "wrapper containing an expression and an environment". So we need one extra step to access the expression of quosure made by enquo.
In this case we use get_expr for our purpose. So here is just another version of writing this function that I thought might be of interest to whomever read this post in the future.
library(dplyr)
library(rlang)
fn <- function(data, Var) {
Var <- enquo(Var)
colnm <- paste(get_expr(Var), "pct", sep = "_")
data %>%
group_by(!!Var) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(!! colnm := count/sum(count))
}
fn(mtcars, cyl)
# A tibble: 3 x 3
cyl count cyl_pct
<dbl> <int> <dbl>
1 4 11 0.344
2 6 7 0.219
3 8 14 0.438

Apply a custom function over levels of a factor in a dataframe

I'm trying to apply a tidyverse-based approach, or at least a tidy solution, for applying custom functions over the levels of a factor in a dataframe.
Consider the following test dataset:
df <- tibble(LINE=rep(c(1,2),each=6), FOUND=c(1,1,1,0,1,1,0,0,1,0,0,1))
# LINE FOUND
# <dbl> <dbl>
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 0
# 5 1 1
# 6 1 1
# 7 2 0
# 8 2 0
# 9 2 1
#10 2 0
#11 2 0
#12 2 1
I want to know for example the proportion of found results (eg. FOUND==1) by level of the LINE factor. Right now, I'm working with the following code, but I'm really trying to get to something cleaner.
# This is the function to calculate the proportion "found"
get_prop <- function (data) {
tot <- data %>% nrow()
found <- data %>% dplyr::filter(FOUND==1) %>% nrow
found / tot
}
# This is the code to generate the expected result
lines <- df$LINE %>% unique %>% sort
v_line <- vector()
v_prop <- vector()
for (i in 1:length(lines)) {
tot <- df %>% dplyr::filter(LINE==lines[i])
v_line[i] <- lines[i]
v_prop[i] <- get_prop(tot)
}
df_line = data.frame(LINE = v_line, CALL = v_prop)
I would expect the following to work, but it does not, since its returning the result for each level, but the numerical solution is that of the whole dataset, and not levels-specific:
df %>% dplyr::group_by(LINE) %>% dplyr::summarise(get_prop(.))
EDIT: Please note that what I am looking for is a solution for applying a custom function over the levels of a factor in a dataframe. It is not necessarily the number or the proportion of occurrences of a particular value, as in the example illustrated.
EDIT 2: That is, I'm looking for a solution that makes use of the get_prop function above. This is not because it is the best way of solving this particular issue, but because it is more generalizable

If you want to apply a custom function group-wise, you can use the group_split command. This will split your data frame into elements of a list. Each list element being a subset of the df. You can then use map to apply your function to each level (note that you can group_split and map in one step by using group_map). I added the last line to get to the form of the original approach.
df %>%
group_by(LINE) %>%
group_split() %>%
map_dbl(get_prop) %>%
tibble(LINE = seq_along(.), CALL = .) # optional to get back to a df
#> # A tibble: 2 x 2
#> LINE CALL
#> <int> <dbl>
#> 1 1 0.833
#> 2 2 0.333
Created on 2020-01-20 by the reprex package (v0.3.0)
Now one thing I'm worried about with this solution is that group_split drops the grouping variable (I would have preferred if it was kept as the names of the list or an attribute). So if you want a tibble as the outcome it might make sense to save the grouping variable beforehand:
groups <- unique(df$LINE)
df %>%
group_by(LINE) %>%
group_split() %>%
map_dbl(get_prop) %>%
tibble(group = groups, result = .)
update
I think the overall cleanest approach would be this (using a more general example):
library(tidyverse)
df <- tibble(LINE=rep(c("a", "b"),each=6), FOUND=c(1,1,1,0,1,1,0,0,1,0,0,1))
lvls <- unique(df$LINE)
df %>%
group_by(LINE) %>%
group_map(~ get_prop(.x)) %>%
setNames(lvls) %>%
unlist() %>%
enframe()
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 a 0.833
#> 2 b 0.333
Created on 2020-01-20 by the reprex package (v0.3.0)

Another option could be to use group_map and then tibble::enframe
library(dplyr)
df %>%
group_by(LINE) %>%
group_map(~get_prop(.)) %>%
unlist() %>%
tibble::enframe()
# name value
# <int> <dbl>
#1 1 0.833
#2 2 0.333
You could also use group_modify which would keep the group names (using #JBGruber's data)
df %>%
group_by(LINE) %>%
group_modify(~ tibble::enframe(get_prop(.), name = NULL))
# LINE value
# <chr> <dbl>
#1 a 0.833
#2 b 0.333

add grouping variable for nested tibbles

This is a follow-up to this question.
I need to be able to group_by() columns in my new nested table. I can't find a purrr function that is does this (although I know a solution exists). I need to group_by in each table to apply additional summarizing functions and fit linear models appropriate. The example here is just a dummy example.
library(tidyverse)
set.seed(2)
N <- 30
df <- tibble(type = rep(c("small","medium","high"), each=N/3),
dummy = rep(c(1,5,10),each=10),
xvals = rep(1:10,3),
A = rnorm(N)*dummy,
B = rnorm(N)*dummy,
C = rnorm(N)*dummy) %>%
mutate(type = factor(type, levels=c("small","medium","high"))) %>%
select(-dummy) %>%
pivot_longer(cols=-c(type,xvals), names_to="metric", values_to = "value") %>%
group_by(type) %>%
group_nest(.key="data")
This produces a tibble with two columns:
df
# A tibble: 3 x 2
type data
<fct> <list>
1 small <tibble [30 x 3]>
2 medium <tibble [30 x 3]>
3 high <tibble [30 x 3]>
This is an example of what I want to do across all the nested tibbles:
df[[2]][[1]] %>%
group_by(metric) %>%
summarize(mean = mean(value))
# A tibble: 3 x 2
metric mean
<chr> <dbl>
1 A 0.211
2 B -0.296
3 C -0.391

After the group_nest, the 'data' is a list column of tibbles and there are only two columns 'type' and 'data'. If we need to create a grouping based on the list column, loop through the list with map and then do the group_by
library(dplyr)
library(tidyr)
library(purrr)
df %>%
mutate(data = map(data, ~ .x %>%
group_by(metric) %>%
summarize(mean = mean(value)))) -> out
out$data[[1]]
# A tibble: 3 x 2
# metric mean
# <chr> <dbl>
#1 A 0.115
#2 B 0.323
#3 C -0.326
NOTE: Output values will be different as there was not set seed

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to calculate the sum of rows by each gvkey respectively? - r

Related

Is there a way to "summarize_by_group" without having to group_by the whole data each time?

creating dataframe of stats from another dataframe in R

How do I create a function to mutate new columns with a variable name and "_pct"?

Apply a custom function over levels of a factor in a dataframe

add grouping variable for nested tibbles

Categories

Resources