R dplyr with seq() - r

I need to mutate a column in a dataframe, with the seq of another column.
For example with iris, I would like to add a new column for each Species, with
seq(min(Sepal.Length),max(Sepal.Length),length=100)
I tried (with no success):
iris %>%
group_by(Species) %>%
mutate(seqq = seq(min(Sepal.Length),max(Sepal.Length), 100))
Any ideas?
thank you!

mutate needs to return the same number of rows as the original data or the ones in the group_by. We may use summarise
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(seq = seq(min(Sepal.Length),max(Sepal.Length),
length = 100), .groups = 'drop')
-output
# A tibble: 300 x 2
# Groups: Species [3]
Species seq
<fct> <dbl>
1 setosa 4.3
2 setosa 4.32
3 setosa 4.33
4 setosa 4.35
5 setosa 4.36
6 setosa 4.38
7 setosa 4.39
8 setosa 4.41
9 setosa 4.42
10 setosa 4.44
# … with 290 more rows

Related

Automate z-score calculation by group

I have the following data frame:
df<- splitstackshape::stratified(iris, group="Species", size=1)
I want to make a z-score for each species including all of the variables. I can do this manually by finding the SD and mean for each row and using the appropriate formula, but I need to do this several times over and would like to find a more efficient way.
I tried using scale(), but can't figure out how to get it to do the row-wise calculation that includes several variables and a grouping variable.
Using dplyr::group_by returns a "'x' must be numeric variable" error.
Are you sure the question is taking a z-score to each group? It should be for each value.
Lets say the functions to take z-score could be:
scale(x, center = TRUE, scale = TRUE)
Or
function_zscore = function(x){x <- x[na.rm = TRUE]; return(((x) - mean(x)) / sd(x))}
Both functions suggest that if the argument x is a vector, the results will return to a vector too.
df<- splitstackshape::stratified(iris, group="Species", size=1)
df <- tidyr::pivot_longer(df, cols = c(1:4), names_to = "var.name", values_to = "value")
df %>%
group_by(Species) %>%
mutate(zscore = scale(value, center = TRUE, scale = TRUE)[,1])
## A tibble: 12 x 4
## Groups: Species [3]
# Species var.name value zscore
# <fct> <chr> <dbl> <dbl>
# 1 setosa Sepal.Length 4.9 1.22
# 2 setosa Sepal.Width 3.1 0.332
# 3 setosa Petal.Length 1.5 -0.455
# 4 setosa Petal.Width 0.2 -1.09
# 5 versicolor Sepal.Length 5.9 1.10
# 6 versicolor Sepal.Width 3.2 -0.403
# 7 versicolor Petal.Length 4.8 0.486
# 8 versicolor Petal.Width 1.8 -1.18
# 9 virginica Sepal.Length 6.5 1.14
#10 virginica Sepal.Width 3 -0.574
#11 virginica Petal.Length 5.2 0.501
#12 virginica Petal.Width 2 -1.06
If we still hope to get a score for each group to describe how a sample deviates around the mean, a possible solution could be getting the coefficient of variation?
df %>%
group_by(Species) %>%
summarise(coef.var = 100*sd(value)/mean(value))
## A tibble: 3 x 2
# Species coef.var
# <fct> <dbl>
#1 setosa 83.8
#2 versicolor 45.8
#3 virginica 49.0

R - how to split into terciles during group_by

Let's say we want to calculate the means of sepal length based on tercile groups of sepal width.
We can use the split_quantile function from the fabricatr package and do the following:
iris %>%
group_by(split_quantile(Sepal.Width, 3)) %>%
summarise(Sepal.Length = mean(Sepal.Length))
So far so good. Now, let's say we want to group_by(Species, split_quantile(Sepal.Width, 3)) instead of just group_by(split_quantile(Sepal.Width, 3)).
However, what if we want the terciles to be calculated inside of the each species type and not generally?
Basically, what I'm looking for could be achieved by splitting iris into several dataframes based on Species, using split_quantile on those dataframes to calculate terciles and then joining the dataframes back together. However, I'm looking for a way to do this without splitting the dataframe.
You kinda have written the answer in your text, but you can create a new variable for tercile after grouping by species, then regroup with both Species and Tercile.
library(tidyverse)
library(fabricatr)
iris %>%
group_by(Species) %>%
mutate(Tercile = split_quantile(Sepal.Width, 3)) %>%
group_by(Species, Tercile) %>%
summarise(Sepal.Length = mean(Sepal.Length))
#> # A tibble: 9 x 3
#> # Groups: Species [3]
#> Species Tercile Sepal.Length
#> <fct> <fct> <dbl>
#> 1 setosa 1 4.69
#> 2 setosa 2 5.08
#> 3 setosa 3 5.27
#> 4 versicolor 1 5.61
#> 5 versicolor 2 6.12
#> 6 versicolor 3 6.22
#> 7 virginica 1 6.29
#> 8 virginica 2 6.73
#> 9 virginica 3 6.81
Created on 2020-05-27 by the reprex package (v0.3.0)

dplyr summarise (collapse) dataset by different functions for multiple columns

I'm trying to dplyr::summarise a dataset (collapse) by different summarise_at/summarise_if functions so that I have the same named variables in my output dataset. Example:
library(tidyverse)
data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris$letter <- as.factor(rep(c("A","B","C"),each=50)) ## factor column
head(iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species year color letter
1 5.1 3.5 1.4 0.2 setosa 2000 red A
2 4.9 3.0 1.4 0.2 setosa 2000 red A
3 4.7 3.2 1.3 0.2 setosa 2000 red A
The resulting dataset should look like this:
full
Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 setosa 2000 87 6.2 5.8 1.9 A red
2 setosa 3000 84.4 6.1 5.5 1.9 A red
3 versicolor 2000 69.4 33.6 7 4.9 B green
4 versicolor 3000 69.1 32.7 6.8 5.1 B green
5 virginica 2000 73.2 51.1 7.7 6.9 C blue
6 virginica 3000 75.5 50.2 7.9 6.4 C blue
I can achieve this by doing the following which is a bit repetitive:
sums <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Width")), list(sum))
max <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Length")), list(max))
last <- iris %>%
group_by(Species, year) %>%
summarise_if(is.factor, list(last))
first <- iris %>%
group_by(Species, year) %>%
summarise_if(is.character, list(first))
full <- full_join(sums, max) %>% full_join(last) %>% full_join(first)
I have found similar approaches below but can't figure out the approach I've tried here. I would prefer not to make my own function as I think something like this is cleaner by passing everything through a pipe and joining:
test <- iris %>%
#group_by(.vars = vars(Species, year)) %>% #why doesnt this work?
group_by_at(.vars = vars(Species, year)) %>% #doesnt work
{left_join(
summarise_at(., vars(matches("Width")), list(sum)),
summarise_at(., vars(matches("Length")), list(max)),
summarise_if(., is.factor, list(last)),
summarise_if(., is.character, list(first))
)
} #doesnt work
This doesnt work, any suggestions or other approaches?
Helpful:
How can I use summarise_at to apply different functions to different columns?
Summarize different Columns with different Functions
Using dplyr summarize with different operations for multiple columns
By default, the dplyr::left_join() function only accepts two data frames. If you want to use this function with more than two data frames, you can iterate it with the Reduce function (base R function):
iris %>%
group_by(Species, year) %>%
{
Reduce(
function(x, y) left_join(x, y),
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
))
}
# Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 setosa 2000 87 6.2 5.8 1.9 A red
# 2 setosa 3000 84.4 6.1 5.5 1.9 A red
# 3 versicolor 2000 69.4 33.6 7 4.9 B green
# 4 versicolor 3000 69.1 32.7 6.8 5.1 B green
# 5 virginica 2000 73.2 51.1 7.7 6.9 C blue
# 6 virginica 3000 75.5 50.2 7.9 6.4 C blue
Furthermore, notice I had to call functions from its package by using :: in order to avoid name overlapping with previously created data frames.
Robbing #Ulises idea and using purrr::reduce instead of Reduce is an alternative:
iris %>%
group_by(Species, year) %>%
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
) %>%
.[c(2:5)] %>%
reduce(left_join)
OR solution with curly brackets to suppress the first argument:
iris %>%
group_by(Species, year) %>%
{
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
)
} %>%
reduce(left_join)

Is there a way to use a lookup value from a table in a mutate column?

library(tidyverse)
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
mutate(new_col = rank == 4, Sepal.Width)
table <- df %>%
filter(rank == 4) %>%
select(Species, new_col = Sepal.Width)
correct_df <- left_join(df, table, by = "Species")
df
#> # A tibble: 150 x 8
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Dim
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0.280
#> 2 4.9 3 1.4 0.2 setosa 0.280
#> 3 4.7 3.2 1.3 0.2 setosa 0.26
#> 4 4.6 3.1 1.5 0.2 setosa 0.3
#> 5 5 3.6 1.4 0.2 setosa 0.280
#> 6 5.4 3.9 1.7 0.4 setosa 0.68
#> 7 4.6 3.4 1.4 0.3 setosa 0.42
#> 8 5 3.4 1.5 0.2 setosa 0.3
#> 9 4.4 2.9 1.4 0.2 setosa 0.280
#> 10 4.9 3.1 1.5 0.1 setosa 0.15
#> # ... with 140 more rows, and 2 more variables: rank <dbl>, new_col <lgl>
I'm basically looking for new_col to show the value that corresponds with rank = 4 from the Sepal.Width column. In this case, those values would be 3.9, 3.3, and 3.8. I'm envisioning this similar to a VLookup, or Index/Match in Excel.
When ever I think "now I need to use VLOOKUP like I did in the past in Excel" I find the left_join() function helpful. It's also part of the dplyr package. Instead of "looking up" values in one table in another table, it's easier for R to just make one bigger table where one table remains unchanged (here the "left" one or the first term you put in the function) and the other is added using a column or columns they have in common as an index.
In your specific example, I can't entirely understand what you want new_col to have in it. If you want to do Excel-style VLOOKUP in R, then left_join() is the best starting point.
The question is not clear since it does not mention the purpose of a Vlookup or Index/Match like operation from Excel.
Also, you don't mention what value should "new_col" have if rank is not equal to 4.
Assuming the value is NA, the below solution with a simple ifelse would work:
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
ungroup() %>%
mutate(new_col = ifelse(rank == 4, Sepal.Width,NA))
df

Add row in each group using dplyr and add_row()

If I add a new row to the iris dataset with:
iris <- as_tibble(iris)
> iris %>%
add_row(.before=0)
# A tibble: 151 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 NA NA NA NA <NA> <--- Good!
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3.0 1.4 0.2 setosa
It works. So, why can't I add a new row on top of each "subset" with:
iris %>%
group_by(Species) %>%
add_row(.before=0)
Error: is.data.frame(df) is not TRUE
If you want to use a grouped operation, you need do like JasonWang described in his comment, as other functions like mutate or summarise expect a result with the same number of rows as the grouped data frame (in your case, 50) or with one row (e.g. when summarising).
As you probably know, in general do can be slow and should be a last resort if you cannot achieve your result in another way. Your task is quite simple because it only involves adding extra rows in your data frame, which can be done by simple indexing, e.g. look at the output of iris[NA, ].
What you want is essentially to create a vector
indices <- c(NA, 1:50, NA, 51:100, NA, 101:150)
(since the first group is in rows 1 to 50, the second one in 51 to 100 and the third one in 101 to 150).
The result is then iris[indices, ].
A more general way of building this vector uses group_indices.
indices <- seq(nrow(iris)) %>%
split(group_indices(iris, Species)) %>%
map(~c(NA, .x)) %>%
unlist
(map comes from purrr which I assume you have loaded as you have tagged this with tidyverse).
A more recent version would be using group_modify() instead of do().
iris %>%
as_tibble() %>%
group_by(Species) %>%
group_modify(~ add_row(.x,.before=0))
#> # A tibble: 153 x 5
#> # Groups: Species [3]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NA NA NA NA
#> 2 setosa 5.1 3.5 1.4 0.2
#> 3 setosa 4.9 3 1.4 0.2
With a slight variation, this could also be done:
library(purrr)
library(tibble)
iris %>%
group_split(Species) %>%
map_dfr(~ .x %>%
add_row(.before = 1))
# A tibble: 153 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 NA NA NA NA NA
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3 1.4 0.2 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 5 3.6 1.4 0.2 setosa
7 5.4 3.9 1.7 0.4 setosa
8 4.6 3.4 1.4 0.3 setosa
9 5 3.4 1.5 0.2 setosa
10 4.4 2.9 1.4 0.2 setosa
# ... with 143 more rows
This also can be used for grouped data frame, however, it's a bit verbose:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = c(NA, Sepal.Length),
Sepal.Width = c(NA, Sepal.Width),
Petal.Length = c(NA, Petal.Length),
Petal.Width = c(NA, Petal.Width),
Species = c(NA, Species))

Resources