library(tidyverse)
iris %>% as_tibble() %>% select(everything())
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 140 more rows
Say I want to select everything in the iris data frame except Species. How do I list this one exception while utilizing tidyselect::everything()?
My actual pipe is below, and when
... %>%
group_by(`ID`) %>%
fill(everything, .direction = "updown") %>%
... %>%
and I get the following error:
Error: Column ID can't be modified because it's a grouping variable
You would do
iris %>% as_tibble() %>% select(-Species)
but assuming you have good reason not to want that, here's a way using everything()
iris %>% as_tibble() %>% select(setdiff(everything(), one_of("Species")))
#> # A tibble: 150 x 4
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
#> 7 4.6 3.4 1.4 0.3
#> 8 5 3.4 1.5 0.2
#> 9 4.4 2.9 1.4 0.2
#> 10 4.9 3.1 1.5 0.1
#> # ... with 140 more rows
(or just iris %>% as_tibble() %>% select(setdiff(everything(), 5)) if it's acceptable)
Related
I have a tibble data frame in R and I want to add a new column name but the name must come from the value of a variable, what is the easiest way to achieve this?
# let us generate a whole set of new features
iris_tbl = as_tibble(iris)
iris_tbl <- iris_tbl %>% mutate(hello=1)
> print(iris_tbl)
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species hello
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 1
8 5 3.4 1.5 0.2 setosa 1
9 4.4 2.9 1.4 0.2 setosa 1
10 4.9 3.1 1.5 0.1 setosa 1
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
# This is the example that does not work as desired
var_name='hello'
iris_tbl = as_tibble(iris)
iris_tbl <- iris_tbl %>% mutate(var_name=1)
> print(iris_tbl)
# A tibble: 150 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species var_name
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1
7 4.6 3.4 1.4 0.3 setosa 1
8 5 3.4 1.5 0.2 setosa 1
9 4.4 2.9 1.4 0.2 setosa 1
10 4.9 3.1 1.5 0.1 setosa 1
# … with 140 more rows
# ℹ Use `print(n = ...)` to see more rows
In the first example the column name created from mutate is actually a column called 'hello'. In the second example mutate names the column 'var_name', instead of 'hello' which is the desired outcome.
Any suggestions on how to make this as easy as possible?
Enter the command ?dplyr_data_masking
If you read through that, you can see there are at least 2 ways you can get your desired result.
iris_tbl <- iris_tbl %>% mutate("{var_name}" := 1)
Or
iris_tbl <- iris_tbl %>% mutate({{var_name}} := 1)
This question already has answers here:
What is the difference between `%in%` and `==`?
(3 answers)
Closed last year.
I use dplyr quite a lot for data wrangling, but I never figured out dplyr filter behaviour when using filter(df, variable == c(value1, value2)
Lets use iris data set as an example.
library(dplyr)
data(iris)
# I want to filter by Species 'setosa' and 'versicolor'
# Solution 1
filter1 <- filter(iris, Species == 'setosa' | Species == 'versicolor')
nrow(filter1)
[1] 100 # expected result
# Solution 2
filter2 <- filter(iris, Species %in% c('setosa', 'versicolor'))
nrow(filter2)
[1] 100 # expected result
filter1 == filter2 # both solutions return the exact same result
#Solution 3
filter3 <- filter(iris, Species == c('setosa', 'versicolor'))
nrow(filter3)
[1] 50 # unexpected result
unique(filter3$Species)
[1] setosa versicolor
Levels: setosa versicolor virginica
Although Solution 3 is filtering for the intended species, as shown by unique(filter3$Species), it only returns half of the occurrences (50 compared to 100 in Solution 1and Solution2). I would appreciate some guidance on what is actually going on in Solution 3.
filter(iris, Species == c("versicolor", "setosa")) does not make sense in an intuitive way, because one Species is not a 2-tuple:
> "setosa" == c("setosa", "versicolor")
[1] TRUE FALSE
Interestingly, filter(iris, Species == c("setosa", "versicolor")) produce the same results: The first Species of the data frame will be returned, so descending sorting will give you versicolor:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris %>%
as_tibble()
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
iris %>%
filter(Species == c('setosa', 'versicolor')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 4.9 3 1.4 0.2 setosa
#> 2 4.6 3.1 1.5 0.2 setosa
#> 3 5.4 3.9 1.7 0.4 setosa
#> 4 5 3.4 1.5 0.2 setosa
#> 5 4.9 3.1 1.5 0.1 setosa
#> 6 4.8 3.4 1.6 0.2 setosa
#> 7 4.3 3 1.1 0.1 setosa
#> 8 5.7 4.4 1.5 0.4 setosa
#> 9 5.1 3.5 1.4 0.3 setosa
#> 10 5.1 3.8 1.5 0.3 setosa
#> # … with 40 more rows
iris %>%
arrange(Species) %>%
filter(Species == c('versicolor', 'setosa')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 4.9 3 1.4 0.2 setosa
#> 2 4.6 3.1 1.5 0.2 setosa
#> 3 5.4 3.9 1.7 0.4 setosa
#> 4 5 3.4 1.5 0.2 setosa
#> 5 4.9 3.1 1.5 0.1 setosa
#> 6 4.8 3.4 1.6 0.2 setosa
#> 7 4.3 3 1.1 0.1 setosa
#> 8 5.7 4.4 1.5 0.4 setosa
#> 9 5.1 3.5 1.4 0.3 setosa
#> 10 5.1 3.8 1.5 0.3 setosa
#> # … with 40 more rows
iris %>%
arrange(desc(Species)) %>%
filter(Species == c('setosa', 'versicolor')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 6.4 3.2 4.5 1.5 versicolor
#> 2 5.5 2.3 4 1.3 versicolor
#> 3 5.7 2.8 4.5 1.3 versicolor
#> 4 4.9 2.4 3.3 1 versicolor
#> 5 5.2 2.7 3.9 1.4 versicolor
#> 6 5.9 3 4.2 1.5 versicolor
#> 7 6.1 2.9 4.7 1.4 versicolor
#> 8 6.7 3.1 4.4 1.4 versicolor
#> 9 5.8 2.7 4.1 1 versicolor
#> 10 5.6 2.5 3.9 1.1 versicolor
#> # … with 40 more rows
iris %>%
arrange(desc(Species)) %>%
filter(Species == c('versicolor', 'setosa')) %>%
as_tibble()
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.9 3.1 4.9 1.5 versicolor
#> 3 6.5 2.8 4.6 1.5 versicolor
#> 4 6.3 3.3 4.7 1.6 versicolor
#> 5 6.6 2.9 4.6 1.3 versicolor
#> 6 5 2 3.5 1 versicolor
#> 7 6 2.2 4 1 versicolor
#> 8 5.6 2.9 3.6 1.3 versicolor
#> 9 5.6 3 4.5 1.5 versicolor
#> 10 6.2 2.2 4.5 1.5 versicolor
#> # … with 40 more rows
Created on 2022-02-11 by the reprex package (v2.0.0)
I am new to tidyverse. I want to join all columns but one (as the names of the other columns might vary). Here an example with iris that does not work obviously. Thanks :)
library(tidyverse)
dat <- as_tibble(iris)
dat %>% mutate(New = str_c(!Sepal.Length, sep="_"))
We can use select to select the columns that we want to paste and apply str_c with do.call.
library(tidyverse)
dat %>% mutate(New = do.call(str_c, c(select(., !Sepal.Length), sep="_")))
However, using unite would be simpler.
dat %>% unite(New, !Sepal.Length, sep="_", remove= FALSE)
# Sepal.Length New Sepal.Width Petal.Length Petal.Width Species
# <dbl> <chr> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5_1.4_0.2_setosa 3.5 1.4 0.2 setosa
# 2 4.9 3_1.4_0.2_setosa 3 1.4 0.2 setosa
# 3 4.7 3.2_1.3_0.2_setosa 3.2 1.3 0.2 setosa
# 4 4.6 3.1_1.5_0.2_setosa 3.1 1.5 0.2 setosa
# 5 5 3.6_1.4_0.2_setosa 3.6 1.4 0.2 setosa
# 6 5.4 3.9_1.7_0.4_setosa 3.9 1.7 0.4 setosa
# 7 4.6 3.4_1.4_0.3_setosa 3.4 1.4 0.3 setosa
# 8 5 3.4_1.5_0.2_setosa 3.4 1.5 0.2 setosa
# 9 4.4 2.9_1.4_0.2_setosa 2.9 1.4 0.2 setosa
#10 4.9 3.1_1.5_0.1_setosa 3.1 1.5 0.1 setosa
# … with 140 more rows
using base
dat <- iris
cols <- grepl("Sepal.Length", names(dat))
tmp <- dat[, !cols]
dat$new <- apply(tmp, 1, paste0, collapse = "_")
head(dat)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
#> 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
#> 2 4.9 3.0 1.4 0.2 setosa 3.0_1.4_0.2_setosa
#> 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
#> 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
#> 5 5.0 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
#> 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
Created on 2021-02-01 by the reprex package (v1.0.0)
We can reduce
library(dplyr)
library(purrr)
library(stringr)
dat %>%
mutate(New = select(., -Sepal.Length) %>%
reduce(str_c, sep="_"))
# A tibble: 150 x 6
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
# <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 5.1 3.5 1.4 0.2 setosa 3.5_1.4_0.2_setosa
# 2 4.9 3 1.4 0.2 setosa 3_1.4_0.2_setosa
# 3 4.7 3.2 1.3 0.2 setosa 3.2_1.3_0.2_setosa
# 4 4.6 3.1 1.5 0.2 setosa 3.1_1.5_0.2_setosa
# 5 5 3.6 1.4 0.2 setosa 3.6_1.4_0.2_setosa
# 6 5.4 3.9 1.7 0.4 setosa 3.9_1.7_0.4_setosa
# 7 4.6 3.4 1.4 0.3 setosa 3.4_1.4_0.3_setosa
# 8 5 3.4 1.5 0.2 setosa 3.4_1.5_0.2_setosa
# 9 4.4 2.9 1.4 0.2 setosa 2.9_1.4_0.2_setosa
#10 4.9 3.1 1.5 0.1 setosa 3.1_1.5_0.1_setosa
# … with 140 more rows
This question already has answers here:
Give name to list variable
(3 answers)
Closed 3 years ago.
using group_split from dplyr but I need every dataframe in the list to preserve the name.
Example from dplyr documentation (notice the dataframes are numbered. The optimal output is every dataframe to have the name of the grouped variable (Setosa, versicolor....):
ir <- iris %>%
group_by(Species)
group_split(ir)
#> [[1]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 40 more rows
#>
#> [[2]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.4 3.2 4.5 1.5 versicolor
#> 3 6.9 3.1 4.9 1.5 versicolor
#> 4 5.5 2.3 4 1.3 versicolor
#> 5 6.5 2.8 4.6 1.5 versicolor
#> 6 5.7 2.8 4.5 1.3 versicolor
#> 7 6.3 3.3 4.7 1.6 versicolor
#> 8 4.9 2.4 3.3 1 versicolor
#> 9 6.6 2.9 4.6 1.3 versicolor
#> 10 5.2 2.7 3.9 1.4 versicolor
#> # … with 40 more rows
#>
#> [[3]]
#> # A tibble: 50 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 6.3 3.3 6 2.5 virginica
#> 2 5.8 2.7 5.1 1.9 virginica
#> 3 7.1 3 5.9 2.1 virginica
#> 4 6.3 2.9 5.6 1.8 virginica
#> 5 6.5 3 5.8 2.2 virginica
#> 6 7.6 3 6.6 2.1 virginica
#> 7 4.9 2.5 4.5 1.7 virginica
#> 8 7.3 2.9 6.3 1.8 virginica
#> 9 6.7 2.5 5.8 1.8 virginica
#> 10 7.2 3.6 6.1 2.5 virginica
#> # … with 40 more rows
#>
#> attr(,"ptype")
#> # A tibble: 0 x 5
#> # … with 5 variables: Sepal.Length <dbl>, Sepal.Width <dbl>,
#> # Petal.Length <dbl>, Petal.Width <dbl>, Species <fct>
group_split does not preserve names. From ?group_split
it does not name the elements of the list based on the grouping as this typically loses information and is confusing.
You could use base base::split for that
split(iris, iris$Species)
Or name the list of tibbles separately using setNames.
library(dplyr)
group_split(ir) %>% setNames(unique(iris$Species))
group_split split based on factor levels of data, so if we want to split them based on their occurrence in the data, we might have to rearrange the factor levels. In iris dataset the factor levels are in the same order as they occur in the data, hence the above works.
More generally we should use.
iris %>%
mutate(Species= factor(Species, levels = unique(Species))) %>%
group_split(Species) %>%
setNames(unique(iris$Species))
We can use set_names from tidyverse
library(tidyverse)
ir %>%
group_split() %>%
set_names(levels(iris$Species))
Using the Iris dataframe I can pretty easily pull the first n = 100 records with:
m_data<-iris
m_data[1:100,]
But I am also interested in pulling the first 100 records based on a nice split of the Species. Assume for the moment that the first 100 records are all the same species - I would like to pull the data with a "first sampling" based on the varying Species instead.
Any suggestions are welcome. Thank you.
You can also do this with dplyr, here selecting the first 10 from each species:
library(dplyr)
iris %>%
group_by(Species) %>%
filter(row_number() <= 10) # or slice(1:10)
#> # A tibble: 30 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 20 more rows
Created on 2018-08-13 by the reprex package (v0.2.0).
Here's an alternative:
do.call(rbind, lapply(split(iris, iris$Species), head, 100))
This pulls the first 100 records from iris by Species
You can use by instead of lapply
do.call(rbind, by(iris, iris$Species, head, 100))