Using R & dplyr to summarize - group_by, count, mean, sd [closed]

Using R & dplyr to summarize - group_by, count, mean, sd [closed] - r

I am fairly new to R and even newer to dplyr. I have a small data set comprised of 2 columns - var1 and var2. The var1 column is comprised of num values. The var2 column is comprised of factors with 3 levels - A, B, and C.
var1 var2
1 1.4395244 A
2 1.7698225 A
3 3.5587083 A
4 2.0705084 A
5 2.1292877 A
6 3.7150650 B
7 2.4609162 B
8 0.7349388 B
9 1.3131471 B
10 1.5543380 B
11 3.2240818 C
12 2.3598138 C
13 2.4007715 C
14 2.1106827 C
15 1.4441589 C
'data.frame': 15 obs. of 2 variables:
$ var1: num 1.44 1.77 3.56 2.07 2.13 ...
$ var2: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 2 2 2 2 2 ...
I am trying to use dplyr to group_by var2 (A, B, and C) then count, and summarize the var1 by mean and sd. The count works but rather than provide the mean and sd for each group, I receive the overall mean and sd next to each group.
To try to resolve the issue, I have conducted multiple internet searches. All results seem to offer a similar syntax to the one I am using. I have also read through all of the recommended posts that Stack Overflow offered prior to posting. Also, I tried restarting R and I made sure that I am not using plyr.
Here is the code that I used to create the data set and the dplyr group_by / summarize.
library(dplyr)
set.seed(123)
var1 <- rnorm(15, mean=2, sd=1)
var2 <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
"C", "C", "C", "C", "C")
df <- data.frame(var1, var2)
df
df %>%
group_by(df$var2) %>%
summarize(
count = n(),
mean = mean(df$var1, na.rm = TRUE),
sd = sd(df$var1, na.rm = TRUE)
)
Here are the results:
# A tibble: 3 x 4
`df$var2` count mean sd
<fct> <int> <dbl> <dbl>
1 A 5 2.15 0.845
2 B 5 2.15 0.845
3 C 5 2.15 0.845
The count appears to work showing a count of 5 for each group. Each group is showing the overall mean and sd for the whole column rather than each group. The expected results are the count, mean, and sd for each group.
I am sure I am overlooking something obvious but I would greatly appreciate any assistance.

Even though answered via comments, I felt such a nice reproducible example for a very first question deserved an official answer.
library(dplyr)
set.seed(123)
var1 <- rnorm(15, mean=2, sd=1)
var2 <- c(rep("A", 5), rep("B", 5), rep("C", 5))
df <- data.frame(var1, var2)
df_stat <- df %>% group_by(var2) %>% summarize(
count = n(),
mean = mean(var1, na.rm = TRUE),
sd = sd(var1, na.rm = TRUE))
head(df_stat)
# A tibble: 3 x 4
# var2 count mean sd
# <fct> <int> <dbl> <dbl>
# 1 A 5 2.19 0.811
# 2 B 5 1.96 1.16
# 3 C 5 2.31 0.639

Related

Multiply numbers from different data frames based on all the possible combinations

I have 5 data frames like the ones below:
df_mon <- data.frame(mon = as.factor(c(6, 7, 8, 9, 10)),
number = c(1.11, 1.02, 0.95, 0.92, 0.72))
df_year <- data.frame(year = as.factor(c(1, 2)),
number = c(1.61, 0.4))
df_cat <- data.frame(cat = c("A", "B", "C"),
number = c(1.11, 1.02, 0.44))
df_bin <- data.frame(bin = as.factor(c(1, 2)),
number = c(1.42, 0.56))
df_cat2 <- data.frame(cat2 = c("A", "B", "C", "D", "AA"),
number = c(0.11, 1.22, 1.34, 0.88, 0.75))
I need to multiple all the numbers in the 'number' columns from each of these data frames with each other. So, look at all the possible combinations in the first column in each data set and then take the number and multiple them. The final results data frame should look something like this (First 3 are done)
results_df <- data.frame(combi = c("mon6_year1_catA_bin1_cat2A", "mon6_year1_catA_bin1_cat2B", "mon6_year1_catA_bin1_cat2C"),
final_number = c(1.11*1.61*1.11*1.42*0.11, 1.11*1.61*1.11*1.42*1.22, 1.11*1.61*1.11*1.42*1.34))
We can see the first column in the the results_df shows what combination was used to calculate the final_number. The first example shows, the 'number' column from mon_df cat 6 (1.11) is taken and multiplied with the following:
category 1 (1.61) from df_year
category A (1.11) from df_cat
category 1 (1.42) from df_bin
category A (0.11) from df_cat2
The answer for this combination is 1.11 x 1.61 x 1.11 x 1.42 x 0.11 = 0.3098.
The 2nd row shows the next possible combination and so on.
I'm not sure how to achieve this, so any help will be greatly appreciated!

Maybe you can try expand.grid like below
lst <- list(df_mon, df_year, df_cat, df_bin, df_cat2)
results_df <- data.frame(
combi = do.call(
paste,
c(do.call(
expand.grid,
lapply(lst, function(v) paste0(names(v[1]), v[, 1]))
), sep = "_")
),
final_number = Reduce(
"*",
do.call(
expand.grid,
lapply(lst, `[[`, 2)
)
)
)
which gives
> head(results_df)
combi final_number
1 mon6_year1_catA_bin1_cat2A 0.30985097
2 mon7_year1_catA_bin1_cat2A 0.28472792
3 mon8_year1_catA_bin1_cat2A 0.26518777
4 mon9_year1_catA_bin1_cat2A 0.25681342
5 mon10_year1_catA_bin1_cat2A 0.20098441
6 mon6_year2_catA_bin1_cat2A 0.07698161

Here is an approach using dplyr and tidyr.
df_all <- df_mon %>%
full_join(df_year, by = character()) %>% # by = character() ensures cross join
full_join(df_cat, by = character()) %>%
full_join(df_bin, by = character()) %>%
full_join(df_cat2, by = character()) %>%
pivot_longer(cols = c(-mon, -year, -cat, -bin, -cat2)) %>%
group_by(mon, year, cat, bin, cat2) %>%
summarize(final_number = prod(value), .groups = "keep")
# A tibble: 300 x 6
# Groups: mon, year, cat, bin, cat2 [300]
mon year cat bin cat2 final_number
<fct> <fct> <chr> <fct> <chr> <dbl>
1 6 1 A 1 A 0.310
2 6 1 A 1 AA 2.11
3 6 1 A 1 B 3.44
4 6 1 A 1 C 3.77
5 6 1 A 1 D 2.48
6 6 1 A 2 A 0.122
7 6 1 A 2 AA 0.833
8 6 1 A 2 B 1.36
9 6 1 A 2 C 1.49
10 6 1 A 2 D 0.978
# ... with 290 more rows
It keeps the variables from the other data.frames intact as columns for further analysis, but you could create your combi column with a little paste().

Rank a dataframe based on multiple conditions [duplicate]

Suppose I have the following data
df = data.frame(name=c("A", "B", "C", "D"), score = c(10, 10, 9, 8))
I want to add a new column with the ranking. This is what I'm doing:
df %>% mutate(ranking = rank(score, ties.method = 'first'))
# name score ranking
# 1 A 10 3
# 2 B 10 4
# 3 C 9 2
# 4 D 8 1
However, my desired result is:
# name score ranking
# 1 A 10 1
# 2 B 10 1
# 3 C 9 2
# 4 D 8 3
Clearly rank does not do what I have in mind. What function should I be using?

It sounds like you're looking for dense_rank from "dplyr" -- but applied in a reverse order than what rank normally does.
Try this:
df %>% mutate(rank = dense_rank(desc(score)))
# name score rank
# 1 A 10 1
# 2 B 10 1
# 3 C 9 2
# 4 D 8 3

Other solution when you need to apply the rank to all variables (not just one).
df = data.frame(name = c("A","B","C","D"),
score=c(10,10,9,8), score2 = c(5,1,9,2))
select(df, -name) %>% mutate_all(funs(dense_rank(desc(.))))

#user101089 --- you can try out with this alternative way:
df = data.frame(name = c("A","B","C","D"),
score=c(10,10,9,8), score2 = c(5,1,9,2))
df %>% mutate(rank_score = dense_rank(desc(score)),
rank_score2 = dense_rank(desc(score2)))

Subset by two factors variables [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(8 answers)
Closed 4 years ago.
I'd like to aggregate my dataset considering the interations between two factors (fac1, fac2) and apply a function for this. For example, consider the dataset given by
set.seed(1)
test <- data.frame(fac1 = sample(c("A", "B", "C"), 30, rep = T),
fac2 = sample(c("a", "b"), 30, rep = T),
value = runif(30))
For fac1 == "A" and "fac2 == a" we have five values and I'd like to aggregate by min. Using brutal force I tried this way
min(test[test$fac1 == "A" & test$fac2 == "a", ]$value)

You mention aggregate and that will work here.
aggregate(test$value, test[,1:2], min)
fac1 fac2 x
1 A a 0.32535215
2 B a 0.14330438
3 C a 0.33239467
4 A b 0.33907294
5 B b 0.08424691
6 C b 0.24548851

Here is a tidyverse alternative
test %>% group_by(fac1, fac2) %>% summarise(x = min(value))
## A tibble: 6 x 3
## Groups: fac1 [?]
# fac1 fac2 x
# <fct> <fct> <dbl>
#1 A a 0.325
#2 A b 0.339
#3 B a 0.143
#4 B b 0.0842
#5 C a 0.332
#6 C b 0.245

dplyr mutate: Excluding observations similar to the current one

I have some data like this:
X Y
-----
A 1
A 2
B 3
B 4
C 5
C 6
I would like to add a new column with values equal to the mean of all Ys in rows where X is not euqal to X of the current observation.
In this particlar case we would get
X Y Mean
-------------------
A 1 (3+4+5+6)/4
A 2 (3+4+5+6)/4
B 3 (1+2+5+6)/4
B 4 (1+2+5+6)/4
C 5 (1+2+3+4)/4
C 6 (1+2+3+4)/4
Thanks in advance!

You can likely do this more succinctly, but this will get you the result.
You essentially create a column which contains the total observations and sum of records for the whole data.frame. Then you group by the X column and repeat the process, by taking the difference you can calculate your mean.
data
df <- data.frame(X = c("A", "A", "B", "B", "C", "C"),
Y = c(1:6))
solution
library(tidyverse)
df %>%
mutate(total_sum = sum(Y),
total_obs = n()) %>%
group_by(X) %>%
mutate(group_sum = sum(Y),
group_obs = n()) %>%
ungroup() %>%
mutate(other_group_sum = total_sum - group_sum,
other_group_obs = total_obs - group_obs,
other_mean = other_group_sum/other_group_obs) %>%
select(X, Y, other_mean)
result
# A tibble: 6 x 3
X Y other_mean
<fct> <int> <dbl>
1 A 1 4.50
2 A 2 4.50
3 B 3 3.50
4 B 4 3.50
5 C 5 2.50
6 C 6 2.50

For each observation, find a corresponding centile on a subset determined by factor

Assume I have a data frame like so:
df<-data.frame(f=rep(c("a", "b", "c", "d"), 100), value=rnorm(400))
I want to create a new column, which will contain a centile that an observation belongs to, calculated separately on each factor level.
What would be a reasonably simple and efficient way to do that? The closest I came to a solution was
df$newColumn<-findInterval(df$value, tapply(df$value, df$f, quantile, probs=seq(0, 0.99, 0.01))$df[, "f"])
However, this just gives zeros to all observations. The tapply returns a four-element list of quantile vectors and I'm not sure how to access a relevant element for each observation to pass as an argument for the findInterval function.
The number of rows in the data frame could reach a few millions, so speed is an issue too. The factor column will always have four levels.

With dplyr:
library(dplyr)
df %>%
group_by(f) %>%
mutate(quant = findInterval(value, quantile(value)))
#> Source: local data frame [400 x 3]
#> Groups: f [4]
#>
#> f value quant
#> <fctr> <dbl> <int>
#> 1 a 0.51184061 3
#> 2 b 0.44362348 3
#> 3 c -1.04869448 1
#> 4 d -2.41772425 1
#> 5 a 0.10738332 3
#> 6 b -0.58630348 1
#> 7 c 0.34376820 3
#> 8 d 0.68322738 4
#> 9 a 1.00232314 4
#> 10 b 0.05499391 3
#> # ... with 390 more rows
With data.table:
library(data.table)
dt <- setDT(df)
dt[, quant := findInterval(value, quantile(value)), by = f]
dt
#> f value quant
#> 1: a 0.3608395 3
#> 2: b -0.1028948 2
#> 3: c -2.1903336 1
#> 4: d 0.7470262 4
#> 5: a 0.5292031 3
#> ---
#> 396: d -1.3475332 1
#> 397: a 0.1598605 3
#> 398: b -0.4261003 2
#> 399: c 0.3951650 3
#> 400: d -1.4409000 1
Data:
df <- data.frame(f = rep(c("a", "b", "c", "d"), 100), value = rnorm(400))

I think that data.table is faster, however, a solution without using packages is:
Define a function based on cut or findInterval together with quantile
cut2 <- function(x){
cut( x , breaks=quantile(x, probs = seq(0, 1, 0.01)) , include.lowest=T , labels=1:100)
}
then, apply it by a factor using ave
df$newColumn <- ave(df$values, df$f, FUN=cut2)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using R & dplyr to summarize - group_by, count, mean, sd [closed] - r

Related

Multiply numbers from different data frames based on all the possible combinations

Rank a dataframe based on multiple conditions [duplicate]

Subset by two factors variables [duplicate]

dplyr mutate: Excluding observations similar to the current one

For each observation, find a corresponding centile on a subset determined by factor

Categories

Resources