gathering the iris data set. in R - r

I want to modify the iris data set in R. It has 5 variables: species, sepal.length, sepal.width, petal.length and petal.width. I need a new column called part which specifies if it's the sepal or the petal and length and width columns which show the measurements. the result should look like this:
I also need make another one where there is a column called measure which indicates the type of measurement (length or width) and shows with the species as variables. It should look like this:
How do I do this using tidyr?

Here is an approach where we make a narrow format tidy data set to start, and then use pivot_wider() to get the result with Length and Width columns.
library(tidyr)
# add an ID variable so we can pivot_wider and match measurement for correct observations
id <- 1:nrow(iris)
data <- cbind(id,iris)
data %>% gather(.,key = "part.measurement",value = "value",-id,-Species) %>%
separate(.,part.measurement,c("part","measurement")) -> narrow_data
head(narrow_data[2:5])
> head(narrow_data[2:5])
Species part measurement value
1 setosa Sepal Length 5.1
2 setosa Sepal Length 4.9
3 setosa Sepal Length 4.7
4 setosa Sepal Length 4.6
5 setosa Sepal Length 5.0
6 setosa Sepal Length 5.4
At this point we can use pivot_wider() to create the Length and Width columns. We'll add an arrange() so the sort order matches the image posted with the question.
narrow_data %>% pivot_wider(.,names_from = measurement,values_from = value) %>%
arrange(Species,part)-> wide_data
head(wide_data[2:5])
...and the output:
> head(wide_data[2:5])
# A tibble: 6 x 4
Species part Length Width
<fct> <chr> <dbl> <dbl>
1 setosa Petal 1.4 0.2
2 setosa Petal 1.4 0.2
3 setosa Petal 1.3 0.2
4 setosa Petal 1.5 0.2
5 setosa Petal 1.4 0.2
6 setosa Petal 1.7 0.4
>
The second output is tricky because it essentially merges the 200 observations of part and measurement for each species of flower into an output tibble of 200 rows, one for each combination of part and length for each of the 50 observations of each Species.
# reproduce 2nd output
speciesId <- c(1:200,1:200,1:200) # unique obs within species
narrow_species_data <- cbind(speciesId,narrow_data[order(narrow_data[,1],narrow_data[,3],narrow_data[,4]),c(2:5)])
narrow_species_data %>% pivot_wider(.,names_from= Species,values_from = value) %>%
arrange(part,measurement,speciesId) -> wide_data_species
head(wide_data_species[2:6])
...and the output:
> head(wide_data_species[2:6])
# A tibble: 6 x 5
part measurement setosa versicolor virginica
<chr> <chr> <dbl> <dbl> <dbl>
1 Petal Length 1.4 4.7 6
2 Petal Length 1.4 4.5 5.1
3 Petal Length 1.3 4.9 5.9
4 Petal Length 1.5 4 5.6
5 Petal Length 1.4 4.6 5.8
6 Petal Length 1.7 4.5 6.6
>
A "completely tidyverse" version
Here is a version of both parts of the question that solely uses features from the tidyverse family of packages.
For the first question, we use mutate() and seq_along() to create unique sequential numbers to identify each observation in the original data. We create a narrow form tidy data set with gather(), and then convert it into the desired output with pivot_wider(). To match the order of observations from the image in the original question, we arrange(Species,part).
library(tidyr)
library(dplyr)
# add an ID variable so we can pivot_wider and match measurement for correct observations
iris %>% mutate(id = seq_along(Species)) %>% gather(.,key = "part.measurement",value = "value",-id,-Species) %>%
separate(.,part.measurement,c("part","measurement")) -> narrow_data
narrow_data %>% pivot_wider(.,names_from = measurement,values_from = value) %>%
arrange(Species,part) -> wide_data
head(wide_data[2:5])
...and the output:
> head(wide_data[2:5])
# A tibble: 6 x 4
id part Length Width
<int> <chr> <dbl> <dbl>
1 1 Petal 1.4 0.2
2 2 Petal 1.4 0.2
3 3 Petal 1.3 0.2
4 4 Petal 1.5 0.2
5 5 Petal 1.4 0.2
6 6 Petal 1.7 0.4
>
For the second question, instead of building a vector of sequential IDs for each species and using cbind() it to the rest of the data, we can use dplyr functions to create the sequences within a pipeline.
We use arrange() to sort the data by Species, id, part, and measurement. Then we group_by(Species) so we can use mutate() to create a unique sequential ID with seq_along(). The sort order is important because we want to merge the 1st observation with the 51st observation and the 101st observation.
Then we ungroup() to clear the group_by() and use pivot_wider() with id_cols = speciesId to create the desired output.
narrow_data %>% arrange(Species,id,part,measurement) %>% group_by(Species) %>% mutate(speciesId = seq_along(Species)) %>%
ungroup(.) %>% pivot_wider(.,id_cols=c("speciesId","part","measurement"),names_from= Species,values_from = value) %>%
arrange(part,measurement,speciesId) -> wide_data_species
head(wide_data_species[2:6])
...and the output:
> head(wide_data_species[2:6])
# A tibble: 6 x 5
part measurement setosa versicolor virginica
<chr> <chr> <dbl> <dbl> <dbl>
1 Petal Length 1.4 4.7 6
2 Petal Length 1.4 4.5 5.1
3 Petal Length 1.3 4.9 5.9
4 Petal Length 1.5 4 5.6
5 Petal Length 1.4 4.6 5.8
6 Petal Length 1.7 4.5 6.6
>

This can be done solely with tidyr functions:
First step:
(first <- iris %>%
pivot_longer(cols = -Species, names_sep = "\\.", names_to = c("Part", ".value")))
# A tibble: 300 x 4
Species Part Length Width
<fct> <chr> <dbl> <dbl>
1 setosa Sepal 5.1 3.5
2 setosa Petal 1.4 0.2
3 setosa Sepal 4.9 3
4 setosa Petal 1.4 0.2
5 setosa Sepal 4.7 3.2
6 setosa Petal 1.3 0.2
7 setosa Sepal 4.6 3.1
8 setosa Petal 1.5 0.2
9 setosa Sepal 5 3.6
10 setosa Petal 1.4 0.2
# ... with 290 more rows
Second step:
first %>%
pivot_longer(cols = c(Length, Width), names_to = "Measure") %>%
pivot_wider(names_from = Species, values_from = value, values_fn = list(value = list)) %>%
unnest(cols = -c(Part, Measure))
# A tibble: 200 x 5
Part Measure setosa versicolor virginica
<chr> <chr> <dbl> <dbl> <dbl>
1 Sepal Length 5.1 7 6.3
2 Sepal Length 4.9 6.4 5.8
3 Sepal Length 4.7 6.9 7.1
4 Sepal Length 4.6 5.5 6.3
5 Sepal Length 5 6.5 6.5
6 Sepal Length 5.4 5.7 7.6
7 Sepal Length 4.6 6.3 4.9
8 Sepal Length 5 4.9 7.3
9 Sepal Length 4.4 6.6 6.7
10 Sepal Length 4.9 5.2 7.2
# ... with 190 more rows

This is what I can suggest to achieve the first result:
df <- iris
# Changing column order
df <- df %>%
select(5, 1:4)
Selecting Species, Petal.Length, Sepal.Length columns and gathering:
length <- df %>%
select(1,2,4) %>%
gather("Part", "Length", -1)
length$Part <- gsub(pattern = ".Length", replacement = "", length$Part, )
head(length)
Species Part Length
1 setosa Sepal 5.1
2 setosa Sepal 4.9
3 setosa Sepal 4.7
4 setosa Sepal 4.6
5 setosa Sepal 5.0
6 setosa Sepal 5.4
Selecting Species, Petal.Width, Sepal.Width columns and gathering:
width <- df %>%
select(1,3,5) %>%
gather("Part", "Width", -1)
width$Part <- gsub(pattern = ".Width", replacement = "", width$Part, )
head(width)
Species Part Width
1 setosa Sepal 3.5
2 setosa Sepal 3.0
3 setosa Sepal 3.2
4 setosa Sepal 3.1
5 setosa Sepal 3.6
6 setosa Sepal 3.9
Combinig the 2 datasets:
merged_df <- length %>%
mutate(Width = width$Width)
head(merged_df)
Species Part Length Width
1 setosa Sepal 5.1 3.5
2 setosa Sepal 4.9 3.0
3 setosa Sepal 4.7 3.2
4 setosa Sepal 4.6 3.1
5 setosa Sepal 5.0 3.6
6 setosa Sepal 5.4 3.9

library(tidyverse)
# Pivotting iris to long data
long_iris <- iris |>
mutate(id = row_number()) |>
pivot_longer(
cols = !c(id, Species),
names_to = c("Part", "Measure"),
names_sep = "\\." # Iris variable separator regex
)
long_iris
#> # A tibble: 600 x 5
#> Species id Part Measure value
#> <fct> <int> <chr> <chr> <dbl>
#> 1 setosa 1 Sepal Length 5.1
#> 2 setosa 1 Sepal Width 3.5
#> 3 setosa 1 Petal Length 1.4
#> 4 setosa 1 Petal Width 0.2
#> 5 setosa 2 Sepal Length 4.9
#> 6 setosa 2 Sepal Width 3
#> 7 setosa 2 Petal Length 1.4
#> 8 setosa 2 Petal Width 0.2
#> 9 setosa 3 Sepal Length 4.7
#> 10 setosa 3 Sepal Width 3.2
#> # ... with 590 more rows
# Using the long data, we can repivot to a long format
iris_length_width <- long_iris |>
pivot_wider(
id_cols = c(id, Species, Part),
names_from = Measure,
values_from = value
)
# This achieves the same thing
iris |>
mutate(id = row_number()) |>
pivot_longer(
cols = !c(id, Species),
names_to = c("Part", ".value"),
names_sep = "\\." # Iris variable separator regex
)
#> # A tibble: 300 x 5
#> Species id Part Length Width
#> <fct> <int> <chr> <dbl> <dbl>
#> 1 setosa 1 Sepal 5.1 3.5
#> 2 setosa 1 Petal 1.4 0.2
#> 3 setosa 2 Sepal 4.9 3
#> 4 setosa 2 Petal 1.4 0.2
#> 5 setosa 3 Sepal 4.7 3.2
#> 6 setosa 3 Petal 1.3 0.2
#> 7 setosa 4 Sepal 4.6 3.1
#> 8 setosa 4 Petal 1.5 0.2
#> 9 setosa 5 Sepal 5 3.6
#> 10 setosa 5 Petal 1.4 0.2
#> # ... with 290 more rows
2nd part
long_iris
#> # A tibble: 600 x 5
#> Species id Part Measure value
#> <fct> <int> <chr> <chr> <dbl>
#> 1 setosa 1 Sepal Length 5.1
#> 2 setosa 1 Sepal Width 3.5
#> 3 setosa 1 Petal Length 1.4
#> 4 setosa 1 Petal Width 0.2
#> 5 setosa 2 Sepal Length 4.9
#> 6 setosa 2 Sepal Width 3
#> 7 setosa 2 Petal Length 1.4
#> 8 setosa 2 Petal Width 0.2
#> 9 setosa 3 Sepal Length 4.7
#> 10 setosa 3 Sepal Width 3.2
#> # ... with 590 more rows
# creating a row id by group to create the second output
long_iris |>
group_by(Species) |>
mutate(id = row_number()) |>
ungroup() |>
pivot_wider(
id_cols = c(id, Part, Measure),
names_from = Species,
values_from = value
) |>
arrange(Measure, Part) |>
select(-id)
#> # A tibble: 200 x 5
#> Part Measure setosa versicolor virginica
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Petal Length 1.4 4.7 6
#> 2 Petal Length 1.4 4.5 5.1
#> 3 Petal Length 1.3 4.9 5.9
#> 4 Petal Length 1.5 4 5.6
#> 5 Petal Length 1.4 4.6 5.8
#> 6 Petal Length 1.7 4.5 6.6
#> 7 Petal Length 1.4 4.7 4.5
#> 8 Petal Length 1.5 3.3 6.3
#> 9 Petal Length 1.4 4.6 5.8
#> 10 Petal Length 1.5 3.9 6.1
#> # ... with 190 more rows
Created on 2022-06-15 by the reprex package (v2.0.1)

Related

Create a column in the original dataset to indicate whether the row was drawn in a random stratified sample

I would like to draw a stratified random sample (n = 375) from a dataset. Based on the stratified random sample, I would like to add a column to the original dataset indicating whether the row is in the stratified random sample (1) or not (0).
iris <- iris
# Get a random stratified sample
library(tidyverse)
stratified <- iris %>%
group_by(Species) %>%
sample_n(size=1)
# The final result I would like to get:
iris$sample3 <- 0
iris[21,6] <- 1
iris[65,6] <- 1
iris[106,6] <- 1
After doing that, I would like to repeat the procedure by drawing a second stratified random sample (n = 125) from my first stratified random sample (n = 375) and repeat the creation of a column.
You can add a column to your data frame that has the required number of 1s per group (and 0 otherwise).
set.seed(1)
samples <- 1
sample1 <- iris %>%
group_by(Species) %>%
mutate(sampled = as.numeric(row_number() %in% sample(n(), samples)))
sample1
sample1
#> # A tibble: 150 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sampled
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0
#> 2 4.9 3 1.4 0.2 setosa 0
#> 3 4.7 3.2 1.3 0.2 setosa 0
#> 4 4.6 3.1 1.5 0.2 setosa 1
#> 5 5 3.6 1.4 0.2 setosa 0
#> 6 5.4 3.9 1.7 0.4 setosa 0
#> 7 4.6 3.4 1.4 0.3 setosa 0
#> 8 5 3.4 1.5 0.2 setosa 0
#> 9 4.4 2.9 1.4 0.2 setosa 0
#> 10 4.9 3.1 1.5 0.1 setosa 0
#> # ... with 140 more rows
To get the sampled values, simply filter to find the 1s:
sample1 %>% filter(sampled == 1)
#> # A tibble: 3 x 6
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species sampled
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 4.6 3.1 1.5 0.2 setosa 1
#> 2 5.6 3 4.1 1.3 versicolor 1
#> 3 6.3 3.3 6 2.5 virginica 1
Created on 2022-05-16 by the reprex package (v2.0.1)

How to use dplyr::if_else to mutate across a tibble dataframe?

I wonder how to combine mutate and if_else to transform a data frame into TRUE and FALSE?
For example, mutate a table into TRUE (value >= 2) and FALSE(value <2):
> iris %>% as_tibble() %>% select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
# A tibble: 150 × 4
Sepal.Length Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2
2 4.9 3 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
7 4.6 3.4 1.4 0.3
8 5 3.4 1.5 0.2
9 4.4 2.9 1.4 0.2
10 4.9 3.1 1.5 0.1
# … with 140 more rows
into
Sepal.Length Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl> <dbl>
1 T T F F
2 T T F F
3 T T F F
4 T T F F
5 T T F F
6 T T F F
7 T T F F
Thanks a lot!
iris %>%
mutate(across(where(is.numeric), ~ . >= 2))
You don't need if_else when the result you want is TRUE or FALSE. Generally, ifelse(test, TRUE, FALSE) is a long way of writing test.
Or in base R
iris[1:4] >= 2

Select grouped random rows, change value in one column

For my study design I need to select a total of 12 rows from each group (10 groups) and change the value of one column from 0 to 1.
How would I go about this? I tried the sample_n already, but then it only gives me the randomly selected rows, not the entire dataset.
test <- test %>% group_by(group) %>% mutate(
change_value = sample_n(12)
) %>% ungroup()
Sorry I am stuck after this.
Thank you in advance
Your requirement is not very clear.
case-1 when you want to select 12 random rows from each group, change value of one column and return entire dataset.
library(tidyverse)
set.seed(2021)
iris %>% group_by(Species) %>%
mutate(Sepal.Width = ifelse(sample(1:n(), n()) <= 12, 1, Sepal.Width)) %>%
ungroup()
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 1 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 1 1.4 0.2 setosa
6 5.4 1 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 1 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows

Append dataframe to end of each dataframe in a list of dataframes in r

I would like to add one row to the end of each dataframe in a list of dataframes. In this example, I would like to add the column names as a new row to the bottom of each dataframe in the list of dataframes I created by group_split.
library(dplyr)
col_names1 <- as.data.frame(t(as.data.frame(colnames(iris))))
colnames(col_names1) <- unlist(col_names1[1, ])
rownames(col_names1) <-""
iris %>%
group_split(Species) %>%
bind_rows(col_names1) #errors out: Error: Column `Sepal.Length` can't be converted from numeric to factor
I would like to end up with a list of dataframes, each with their column names as a new row at the bottom of each dataframe in the list.
One issue is the type difference. We can convert to same type and then do the bind_rows. Also, as we are splitting into a list of data.frame, we need to loop over the list (map) and apply the bind_rows)
library(dplyr)
library(purrr)
iris %>%
group_split(Species) %>%
map(~ bind_rows(.x %>%
mutate_all(factor), col_names1))
#[[1]]
# A tibble: 51 x 5
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# * <fct> <fct> <fct> <fct> <fct>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 8 5 3.4 1.5 0.2 setosa
# 9 4.4 2.9 1.4 0.2 setosa
#10 4.9 3.1 1.5 0.1 setosa
# … with 41 more rows
#[[2]]
# A tibble: 51 x 5
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# * <fct> <fct> <fct> <fct> <fct>
#...

dplyr: why do individual count summaries and index summaries differ

I'm creating a new column, with a count of grouped summaries within a function. Why does:
iris %>%
group_by(Species) %>%
mutate(Count = sum(Sepal.Length + Sepal.Width + Petal.Length + Petal.Width))
Not produce the same result as
iris %>% mutate(count = sum(.[1:ncol(.)])
Or
iris %>%
group_by(Species) %>%
mutate(Count = map_if(is.numeric, sum(rowSums(.))))
And how can I use column indexes to create a count sum for insertion into a function with variably col_names? (The original reason for indexing)
An approach would be to nest after group_by, loop through the nested 'data' with map, select the numeric columns (select_if), mutate to create the 'Count' by getting the sum of rowSums, and unnest
library(tidyverse)
iris %>%
group_by(Species) %>%
nest %>%
mutate(data = map(data, ~ .x %>%
select_if(is.numeric) %>%
mutate(Count = sum(rowSums(.))))) %>%
#or use reduce with sum
# mutate(Count = reduce(., `+`) %>% sum))) %>%
unnest
# A tibble: 150 x 6
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width Count
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 setosa 5.1 3.5 1.4 0.2 507.
# 2 setosa 4.9 3 1.4 0.2 507.
# 3 setosa 4.7 3.2 1.3 0.2 507.
# 4 setosa 4.6 3.1 1.5 0.2 507.
# 5 setosa 5 3.6 1.4 0.2 507.
# 6 setosa 5.4 3.9 1.7 0.4 507.
# 7 setosa 4.6 3.4 1.4 0.3 507.
# 8 setosa 5 3.4 1.5 0.2 507.
# 9 setosa 4.4 2.9 1.4 0.2 507.
#10 setosa 4.9 3.1 1.5 0.1 507.
# ... with 140 more rows

Resources