How to select columns based on string using dplyr - r

I can select and rename the column name like this without any problem:
library(tidyverse)
iris <- as.tibble(iris)
iris %>% select(sepal_ln = Sepal.Length, sepal_wd = Sepal.Width)
#> # A tibble: 150 × 2
#> sepal_ln sepal_wd
#> <dbl> <dbl>
#> 1 5.1 3.5
#> 2 4.9 3.0
#> 3 4.7 3.2
#> 4 4.6 3.1
#> 5 5.0 3.6
#> 6 5.4 3.9
#> 7 4.6 3.4
#> 8 5.0 3.4
#> 9 4.4 2.9
#> 10 4.9 3.1
#> # ... with 140 more rows
But want I want do to do is to call the column from string instead of column name. I tried the following but it failed:
> wanted <- "Sepal"
> iris %>% select(sepal_ln = !! paste0(wanted,".Length"),
+ sepal_wd = !! paste0(wanted,".Width"),
+ )
Error: "Sepal.Length", "Sepal.Width": must resolve to integer column positions, not string
>
What's the right way to do that?

We can use select_
iris %>%
select_(sepal_ln = paste0(wanted, ".Length"), paste0(wanted, ".Width"))
Also, there are wrappers within select to do this more easily i.e. one_of, contains, matches etc. to select the required columns from the data
iris %>%
select(setNames(one_of(paste0(wanted, c(".Length", ".Width"))),
c("sepal_ln", "sepal_wd"))) %>%
head(2)
# A tibble: 2 × 2
# sepal_ln sepal_wd
# <dbl> <dbl>
#1 5.1 3.5
#2 4.9 3.0
NOTE: It is not clear whether the select_ methods will get deprecated in the next dplyr release (0.6.0) or not.

Related

How to add new row with concatenated strings for each group? [duplicate]

If I add a new row to the iris dataset with:
iris <- as_tibble(iris)
> iris %>%
add_row(.before=0)
# A tibble: 151 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 NA NA NA NA <NA> <--- Good!
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3.0 1.4 0.2 setosa
It works. So, why can't I add a new row on top of each "subset" with:
iris %>%
group_by(Species) %>%
add_row(.before=0)
Error: is.data.frame(df) is not TRUE
If you want to use a grouped operation, you need do like JasonWang described in his comment, as other functions like mutate or summarise expect a result with the same number of rows as the grouped data frame (in your case, 50) or with one row (e.g. when summarising).
As you probably know, in general do can be slow and should be a last resort if you cannot achieve your result in another way. Your task is quite simple because it only involves adding extra rows in your data frame, which can be done by simple indexing, e.g. look at the output of iris[NA, ].
What you want is essentially to create a vector
indices <- c(NA, 1:50, NA, 51:100, NA, 101:150)
(since the first group is in rows 1 to 50, the second one in 51 to 100 and the third one in 101 to 150).
The result is then iris[indices, ].
A more general way of building this vector uses group_indices.
indices <- seq(nrow(iris)) %>%
split(group_indices(iris, Species)) %>%
map(~c(NA, .x)) %>%
unlist
(map comes from purrr which I assume you have loaded as you have tagged this with tidyverse).
A more recent version would be using group_modify() instead of do().
iris %>%
as_tibble() %>%
group_by(Species) %>%
group_modify(~ add_row(.x,.before=0))
#> # A tibble: 153 x 5
#> # Groups: Species [3]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NA NA NA NA
#> 2 setosa 5.1 3.5 1.4 0.2
#> 3 setosa 4.9 3 1.4 0.2
With a slight variation, this could also be done:
library(purrr)
library(tibble)
iris %>%
group_split(Species) %>%
map_dfr(~ .x %>%
add_row(.before = 1))
# A tibble: 153 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 NA NA NA NA NA
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3 1.4 0.2 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 5 3.6 1.4 0.2 setosa
7 5.4 3.9 1.7 0.4 setosa
8 4.6 3.4 1.4 0.3 setosa
9 5 3.4 1.5 0.2 setosa
10 4.4 2.9 1.4 0.2 setosa
# ... with 143 more rows
This also can be used for grouped data frame, however, it's a bit verbose:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = c(NA, Sepal.Length),
Sepal.Width = c(NA, Sepal.Width),
Petal.Length = c(NA, Petal.Length),
Petal.Width = c(NA, Petal.Width),
Species = c(NA, Species))

Create unique random group id in R [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 2 years ago.
I am trying to create a unique, randomly assigned (without replacement) group id without using a for loop. This is as far as I got:
library(datasets)
library(dplyr)
data(iris)
iris <- iris %>% group_by(Species) %>% mutate(id = cur_group_id())
This gives me a group id for each iris$Species, however, I would like the group id to randomly assigned from c(1,2,3) as opposed to assigned based on the order of the dataset.
Any help creating this would be very helpful! I am sure there is a way to do this with dplyr but I am stumped...
Maybe you can play some tricks on group_by by adding sample operation, e.g.,
iris <- iris %>%
group_by(factor(Species, levels = sample(levels(Species)))) %>%
mutate(id = cur_group_id())
Here's a sample answer creating a random number and ranking them.
library(datasets)
library(dplyr)
data(iris)
df <- iris %>%
group_by(Species) %>%
mutate(id = runif(1,0,1)) %>%
ungroup() %>%
mutate(id = dense_rank(id))
df %>% sample_n(10)
#> # A tibble: 10 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 4.4 3 1.3 0.2 setosa 3
#> 2 6.5 3 5.5 1.8 virginica 2
#> 3 6.3 2.7 4.9 1.8 virginica 2
#> 4 5 3.6 1.4 0.2 setosa 3
#> 5 6.3 2.3 4.4 1.3 versicolor 1
#> 6 7.9 3.8 6.4 2 virginica 2
#> 7 5.4 3.9 1.7 0.4 setosa 3
#> 8 5.7 4.4 1.5 0.4 setosa 3
#> 9 6.4 2.8 5.6 2.2 virginica 2
#> 10 5.2 3.4 1.4 0.2 setosa 3
Created on 2020-07-29 by the reprex package (v0.3.0)
Here's an approach with sample and recode:
Use seq_along(unique(id)) to create a vector of integer values to recode to.
Use sample to sample the appropriate number of random values.
Use setNames to name the ids with their new random values.
Use !!! to force that vector of named id into a list of expressions.
use recode to change the values.
iris %>%
group_by(Species) %>%
mutate(id = cur_group_id()) %>%
mutate(id = recode(id, !!!setNames(unique(id),
sample(seq_along(unique(id))))))
I think the other answers are better approachs, but having recode with !!! in your toolkit is helpful in other situations.
Randomise the rows and then assign id based on the occurrence of Species :
library(dplyr)
iris %>%
slice_sample(n = nrow(.)) %>%
#sample_n for dplyr < 1.0.0
#sample_n(n()) %>%
mutate(id = match(Species, unique(Species)))

Pivot wider produces nested object

This is regarding latest tidyr release. I am trying pivot_wider & pivot_longer function from library(tidyr) (Update 1.0.0)
I was trying to obtain normal iris dataset when I run below but instead I get nested sort of 3X5 dimension tibble, not sure whats happening (I read https://tidyr.tidyverse.org/articles/pivot.html) but still not sure how to avoid this
library(tidyr)
iris %>% pivot_longer(-Species,values_to = "count") %>%
pivot_wider(names_from = name, values_from = count)
Expected Output: Normal Iris dataset (150 X 5 dimension)
Edit: I read below that if I wrap around unnest() I get expected output. I am not able to understand why to unnest it when we did not nest it anywhere. Any basic help would be appreciated. Want to understand the concept of what went wrong.
As I learnt from Akrun & other helpful friends & post
(Not a bug or anything)
spread(., name, count) throws an error because we have multiple rows for each species x name. pivot_wider does a better job by providing a list-columns instead. If we add unique ID to each row then it works fine.
library(tidyverse)
iris %>%
rowid_to_column() %>%
pivot_longer(-c(rowid, Species), values_to = "count") %>%
pivot_wider(names_from = name, values_from = count) %>%
select(-rowid)
pivot_wider(), unlike nest(), allows us to aggregate multiple values when the rows are not given a unique identifier.
The default is to use list to aggregate and to be verbose about it.
To expand the output we could use unnest() as already suggested but it's more idiomatic to use unchop() because we're not trying to expand a horizontal dimensionality in the nested values.
So to sum it all up to get back your initial data (except it'll be a tibble) you can do:
library(tidyr)
iris %>%
pivot_longer(-Species,values_to = "count") %>%
print() %>%
pivot_wider(names_from = name,
values_from = count,
values_fn = list(count=list)) %>%
print() %>%
unchop(everything()) %>%
print() %>%
all.equal(iris)
#> # A tibble: 600 x 3
#> Species name count
#> <fct> <chr> <dbl>
#> 1 setosa Sepal.Length 5.1
#> 2 setosa Sepal.Width 3.5
#> 3 setosa Petal.Length 1.4
#> 4 setosa Petal.Width 0.2
#> 5 setosa Sepal.Length 4.9
#> 6 setosa Sepal.Width 3
#> 7 setosa Petal.Length 1.4
#> 8 setosa Petal.Width 0.2
#> 9 setosa Sepal.Length 4.7
#> 10 setosa Sepal.Width 3.2
#> # ... with 590 more rows
#> # A tibble: 3 x 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <list<dbl>> <list<dbl>> <list<dbl>> <list<dbl>>
#> 1 setosa [50] [50] [50] [50]
#> 2 versicolor [50] [50] [50] [50]
#> 3 virginica [50] [50] [50] [50]
#> # A tibble: 150 x 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.1 3.5 1.4 0.2
#> 2 setosa 4.9 3 1.4 0.2
#> 3 setosa 4.7 3.2 1.3 0.2
#> 4 setosa 4.6 3.1 1.5 0.2
#> 5 setosa 5 3.6 1.4 0.2
#> 6 setosa 5.4 3.9 1.7 0.4
#> 7 setosa 4.6 3.4 1.4 0.3
#> 8 setosa 5 3.4 1.5 0.2
#> 9 setosa 4.4 2.9 1.4 0.2
#> 10 setosa 4.9 3.1 1.5 0.1
#> # ... with 140 more rows
#> [1] TRUE
Created on 2019-09-15 by the reprex package (v0.3.0)

Is there a way to use a lookup value from a table in a mutate column?

library(tidyverse)
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
mutate(new_col = rank == 4, Sepal.Width)
table <- df %>%
filter(rank == 4) %>%
select(Species, new_col = Sepal.Width)
correct_df <- left_join(df, table, by = "Species")
df
#> # A tibble: 150 x 8
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Dim
#> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 1.4 0.2 setosa 0.280
#> 2 4.9 3 1.4 0.2 setosa 0.280
#> 3 4.7 3.2 1.3 0.2 setosa 0.26
#> 4 4.6 3.1 1.5 0.2 setosa 0.3
#> 5 5 3.6 1.4 0.2 setosa 0.280
#> 6 5.4 3.9 1.7 0.4 setosa 0.68
#> 7 4.6 3.4 1.4 0.3 setosa 0.42
#> 8 5 3.4 1.5 0.2 setosa 0.3
#> 9 4.4 2.9 1.4 0.2 setosa 0.280
#> 10 4.9 3.1 1.5 0.1 setosa 0.15
#> # ... with 140 more rows, and 2 more variables: rank <dbl>, new_col <lgl>
I'm basically looking for new_col to show the value that corresponds with rank = 4 from the Sepal.Width column. In this case, those values would be 3.9, 3.3, and 3.8. I'm envisioning this similar to a VLookup, or Index/Match in Excel.
When ever I think "now I need to use VLOOKUP like I did in the past in Excel" I find the left_join() function helpful. It's also part of the dplyr package. Instead of "looking up" values in one table in another table, it's easier for R to just make one bigger table where one table remains unchanged (here the "left" one or the first term you put in the function) and the other is added using a column or columns they have in common as an index.
In your specific example, I can't entirely understand what you want new_col to have in it. If you want to do Excel-style VLOOKUP in R, then left_join() is the best starting point.
The question is not clear since it does not mention the purpose of a Vlookup or Index/Match like operation from Excel.
Also, you don't mention what value should "new_col" have if rank is not equal to 4.
Assuming the value is NA, the below solution with a simple ifelse would work:
df <- iris %>%
group_by(Species) %>%
mutate(Petal.Dim = Petal.Length * Petal.Width,
rank = rank(desc(Petal.Dim))) %>%
ungroup() %>%
mutate(new_col = ifelse(rank == 4, Sepal.Width,NA))
df

Add row in each group using dplyr and add_row()

If I add a new row to the iris dataset with:
iris <- as_tibble(iris)
> iris %>%
add_row(.before=0)
# A tibble: 151 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 NA NA NA NA <NA> <--- Good!
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3.0 1.4 0.2 setosa
It works. So, why can't I add a new row on top of each "subset" with:
iris %>%
group_by(Species) %>%
add_row(.before=0)
Error: is.data.frame(df) is not TRUE
If you want to use a grouped operation, you need do like JasonWang described in his comment, as other functions like mutate or summarise expect a result with the same number of rows as the grouped data frame (in your case, 50) or with one row (e.g. when summarising).
As you probably know, in general do can be slow and should be a last resort if you cannot achieve your result in another way. Your task is quite simple because it only involves adding extra rows in your data frame, which can be done by simple indexing, e.g. look at the output of iris[NA, ].
What you want is essentially to create a vector
indices <- c(NA, 1:50, NA, 51:100, NA, 101:150)
(since the first group is in rows 1 to 50, the second one in 51 to 100 and the third one in 101 to 150).
The result is then iris[indices, ].
A more general way of building this vector uses group_indices.
indices <- seq(nrow(iris)) %>%
split(group_indices(iris, Species)) %>%
map(~c(NA, .x)) %>%
unlist
(map comes from purrr which I assume you have loaded as you have tagged this with tidyverse).
A more recent version would be using group_modify() instead of do().
iris %>%
as_tibble() %>%
group_by(Species) %>%
group_modify(~ add_row(.x,.before=0))
#> # A tibble: 153 x 5
#> # Groups: Species [3]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NA NA NA NA
#> 2 setosa 5.1 3.5 1.4 0.2
#> 3 setosa 4.9 3 1.4 0.2
With a slight variation, this could also be done:
library(purrr)
library(tibble)
iris %>%
group_split(Species) %>%
map_dfr(~ .x %>%
add_row(.before = 1))
# A tibble: 153 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 NA NA NA NA NA
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3 1.4 0.2 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 5 3.6 1.4 0.2 setosa
7 5.4 3.9 1.7 0.4 setosa
8 4.6 3.4 1.4 0.3 setosa
9 5 3.4 1.5 0.2 setosa
10 4.4 2.9 1.4 0.2 setosa
# ... with 143 more rows
This also can be used for grouped data frame, however, it's a bit verbose:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = c(NA, Sepal.Length),
Sepal.Width = c(NA, Sepal.Width),
Petal.Length = c(NA, Petal.Length),
Petal.Width = c(NA, Petal.Width),
Species = c(NA, Species))

Resources