I have a question about for loop in r. I have used the following for loop
for (i in 1:length(unique(iris$Species))) {
datu <- data.frame(ID = unique(i),
Sl = mean(iris$Sepal.Length),
Sw = mean(iris$Sepal.Width))
}
to get the mean of each unique species in iris. But my final data only has one observation. However my desired output is separate for setosa versicolor virginica. What should i change in this code? Thanks
We don't need a loop. It can be done with group by approach
setNames(aggregate(.~ Species, iris[c(1, 2, 5)], mean), c("ID", "Sl", "Sw"))
-output
ID Sl Sw
1 setosa 5.006 3.428
2 versicolor 5.936 2.770
3 virginica 6.588 2.974
Or with tidyverse
library(dplyr)
library(stringr)
iris %>%
group_by(ID = Species) %>%
summarise(across(starts_with("Sepal"), ~ mean(.x, na.rm = TRUE),
.names = "{str_to_title(str_remove_all(.col, '[a-z.]+'))}"))
-output
# A tibble: 3 × 3
ID Sl Sw
<fct> <dbl> <dbl>
1 setosa 5.01 3.43
2 versicolor 5.94 2.77
3 virginica 6.59 2.97
In the loop, the unique(i) is just i, instead if we meant unique(iris$Species)[i]. In addition, the datu will get updated in each iteration, returning only the last output from the iteration. Instead, it can be stored in a list and rbind later or use
datu <- data.frame()
for (i in 1:length(unique(iris$Species))) {
unqSp <- unique(iris$Species)[i]
i1 <- iris$Species == unqSp
datu <- rbind(datu, data.frame(ID = unqSp,
Sl = mean(iris$Sepal.Length[i1]),
Sw = mean(iris$Sepal.Width[i1])))
}
-output
> datu
ID Sl Sw
1 setosa 5.006 3.428
2 versicolor 5.936 2.770
3 virginica 6.588 2.974
A tidyverse approach using dplyr.
dplyr::summarize
‘summarise()’ creates a new data frame. It will have one (or more)
rows for each combination of grouping variables; if there are no
grouping variables, the output will have a single row summarising
all observations in the input.
library(dplyr)
iris %>%
group_by(Species) %>%
summarize(Sl = mean(Sepal.Length), Sw = mean(Sepal.Width))
# A tibble: 3 × 3
Species Sl Sw
<fct> <dbl> <dbl>
1 setosa 5.01 3.43
2 versicolor 5.94 2.77
3 virginica 6.59 2.97
Related
I have the following data frame:
df<- splitstackshape::stratified(iris, group="Species", size=1)
I want to make a z-score for each species including all of the variables. I can do this manually by finding the SD and mean for each row and using the appropriate formula, but I need to do this several times over and would like to find a more efficient way.
I tried using scale(), but can't figure out how to get it to do the row-wise calculation that includes several variables and a grouping variable.
Using dplyr::group_by returns a "'x' must be numeric variable" error.
Are you sure the question is taking a z-score to each group? It should be for each value.
Lets say the functions to take z-score could be:
scale(x, center = TRUE, scale = TRUE)
Or
function_zscore = function(x){x <- x[na.rm = TRUE]; return(((x) - mean(x)) / sd(x))}
Both functions suggest that if the argument x is a vector, the results will return to a vector too.
df<- splitstackshape::stratified(iris, group="Species", size=1)
df <- tidyr::pivot_longer(df, cols = c(1:4), names_to = "var.name", values_to = "value")
df %>%
group_by(Species) %>%
mutate(zscore = scale(value, center = TRUE, scale = TRUE)[,1])
## A tibble: 12 x 4
## Groups: Species [3]
# Species var.name value zscore
# <fct> <chr> <dbl> <dbl>
# 1 setosa Sepal.Length 4.9 1.22
# 2 setosa Sepal.Width 3.1 0.332
# 3 setosa Petal.Length 1.5 -0.455
# 4 setosa Petal.Width 0.2 -1.09
# 5 versicolor Sepal.Length 5.9 1.10
# 6 versicolor Sepal.Width 3.2 -0.403
# 7 versicolor Petal.Length 4.8 0.486
# 8 versicolor Petal.Width 1.8 -1.18
# 9 virginica Sepal.Length 6.5 1.14
#10 virginica Sepal.Width 3 -0.574
#11 virginica Petal.Length 5.2 0.501
#12 virginica Petal.Width 2 -1.06
If we still hope to get a score for each group to describe how a sample deviates around the mean, a possible solution could be getting the coefficient of variation?
df %>%
group_by(Species) %>%
summarise(coef.var = 100*sd(value)/mean(value))
## A tibble: 3 x 2
# Species coef.var
# <fct> <dbl>
#1 setosa 83.8
#2 versicolor 45.8
#3 virginica 49.0
I want to group by keeping the continuous columns as rows and the categorical factors as the column headers with the aggregated record being the mean or min or max. This is a fundamental question, the answer to which I am not being able to figure out. Take the iris data as an example. I want to get the mean of sepal.width and sepal.length with respect to every species category.
library(dplyr)
mydata2 <-iris
# Groupby function for dataframe in R
summarise_at(group_by(mydata2,Species),vars(Sepal.Length),funs(mean(.,na.rm=TRUE)))
OUTPUT
Species Sepal.Length
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
I want to get the same output with Sepal.Length as my rows instead of Species and the various factors of Species as my columns. I also want Sepal.Width, Petal.Length, Petal.Width as well How will I do that?
This is what I am looking for -
Species setosa versicolor virginica
1 Sepal.Length 5.01 5.94 6.59
Below this there should be Sepal.Width and other continuous columns as well.
I have tried transposing but that is changing everything to character data type.
One option to achieve your desired result would be to reshape your data after summarise via e.g. pivot_longer and pivot_wider. If you do that often you could put the code into a convenience function to do that in one step:
Note: I also dropped the summarise_at and switched to the new API using across and where.
library(dplyr)
library(tidyr)
summarise(group_by(iris, Species), across(where(is.numeric), mean, na.rm=TRUE)) %>%
pivot_longer(-Species, names_to = "var") %>%
pivot_wider(names_from = Species, values_from = value)
#> # A tibble: 4 × 4
#> var setosa versicolor virginica
#> <chr> <dbl> <dbl> <dbl>
#> 1 Sepal.Length 5.01 5.94 6.59
#> 2 Sepal.Width 3.43 2.77 2.97
#> 3 Petal.Length 1.46 4.26 5.55
#> 4 Petal.Width 0.246 1.33 2.03
You can use tapply insinde lapply:
do.call(rbind, lapply(iris[sapply(iris, is.numeric)],
function(x) tapply(x, iris$Species, mean)))
# setosa versicolor virginica
#Sepal.Length 5.006 5.936 6.588
#Sepal.Width 3.428 2.770 2.974
#Petal.Length 1.462 4.260 5.552
#Petal.Width 0.246 1.326 2.026
This can be a very dumb question regarding application of 'across' function with multiple columns and different funs applied to each. For eg, below is an example from 'iris' data where column names starting with 'Sepal' are averaged. What if I along with this I want to take median of columns starting with 'Petal'. How can I give the two different column types and funs in same summarise?
iris %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), mean))
You can add multiple across statements in one summarise/mutate :
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), mean),
across(starts_with("Petal"), median))
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#* <fct> <dbl> <dbl> <dbl> <dbl>
#1 setosa 5.01 3.43 1.5 0.2
#2 versicolor 5.94 2.77 4.35 1.3
#3 virginica 6.59 2.97 5.55 2
I am aggregating some data and I want to add group sizes N to the output table. Until recently, the code below worked fine. Now, N is equal to the rowcount of my table.
iris %>%
group_by(Species) %>%
group_by(N = n(), .add = TRUE) %>%
summarise_all(list(~mean(., na.rm = TRUE)))
# A tibble: 3 x 6
# Groups: Species [3]
Species N Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 setosa 150 5.01 3.43 1.46 0.246
2 versicolor 150 5.94 2.77 4.26 1.33
3 virginica 150 6.59 2.97 5.55 2.03
This looks like a recently introduced bug. Can be reproduced on dplyr 1.0.3 but not on 1.0.2.
You could however, avoid the second group_by completely in this case.
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(across(.fns = mean, na.rm = TRUE),
N = n())
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width N
#* <fct> <dbl> <dbl> <dbl> <dbl> <int>
#1 setosa 5.01 3.43 1.46 0.246 50
#2 versicolor 5.94 2.77 4.26 1.33 50
#3 virginica 6.59 2.97 5.55 2.03 50
Try this:
rm(list = ls())
library(dplyr)
iris %>%
group_by(Species) %>%
group_by(N = n(), .add = TRUE) %>%
summarise_all(list(~mean(., na.rm = TRUE)))
In R, working in the tidyverse:
My data sources change. There's a column which is only present some weeks. When it is, I want to summarize it. Using iris as an example, suppose that Sepal.Width is sometimes missing. Conceptually, I want a function like this
library(tidyverse)
summIris <- function(irisDf){
irisDf %>%
group_by(Species) %>%
summarise_ifPresent(
Sepal.Length = mean(Sepal.Length),
Sepal.Width = mean(Sepal.Width))
}
Which'd return
R > summIris(iris )
# A tibble: 3 x 3
Species Sepal.Length Sepal.Width
<fct> <dbl> <dbl>
1 setosa 5.01 3.43
2 versicolor 5.94 2.77
3 virginica 6.59 2.97
> summIris(iris %>% select(- Sepal.Width ))
# A tibble: 3 x 2
Species Sepal.Length
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
I could work around by wrapping the logic in if else. But is there something more concise and elegant?
summarize_at allows you to define on which columns you execute the summary, and you can use starts_with, ends_with, matches, or contains to dynamically select columns.
library(dplyr)
iris %>%
group_by(Species) %>%
summarize_at(vars(starts_with("Sepal")), funs(mean(.)))
# # A tibble: 3 x 3
# Species Sepal.Length Sepal.Width
# <fct> <dbl> <dbl>
# 1 setosa 5.01 3.43
# 2 versicolor 5.94 2.77
# 3 virginica 6.59 2.97
iris %>%
select(-Sepal.Length) %>%
group_by(Species) %>%
summarize_at(vars(starts_with("Sepal")), funs(mean(.)))
# # A tibble: 3 x 2
# Species Sepal.Width
# <fct> <dbl>
# 1 setosa 3.43
# 2 versicolor 2.77
# 3 virginica 2.97
Another one also works but gives a warning with unfound columns:
iris %>%
select(-Sepal.Length) %>%
group_by(Species) %>%
summarize_at(vars(one_of(c("Sepal.Width", "Sepal.Length"))), funs(mean(.)))
# Warning: Unknown columns: `Sepal.Length`
# # A tibble: 3 x 2
# Species Sepal.Width
# <fct> <dbl>
# 1 setosa 3.43
# 2 versicolor 2.77
# 3 virginica 2.97