Unique values in R using dplyr - r

starwars %>%
group_by(species,sex) %>%
summarise() %>%
select(unique.species=species, unique.sex=sex)
How to get unique values from 2 columns("species","sex") all together? I wrote the code above but i'm not sure it's right. Thank you

library(tidyverse)
starwars |>
select(species, sex) |>
distinct()
#> # A tibble: 41 × 2
#> species sex
#> <chr> <chr>
#> 1 Human male
#> 2 Droid none
#> 3 Human female
#> 4 Wookiee male
#> 5 Rodian male
#> 6 Hutt hermaphroditic
#> 7 Yoda's species male
#> 8 Trandoshan male
#> 9 Mon Calamari male
#> 10 Ewok male
#> # … with 31 more rows
Created on 2022-04-25 by the reprex package (v2.0.1)

library(tidyverse)
starwars %>%
expand(nesting(species, sex))
#> # A tibble: 41 × 2
#> species sex
#> <chr> <chr>
#> 1 Aleena male
#> 2 Besalisk male
#> 3 Cerean male
#> 4 Chagrian male
#> 5 Clawdite female
#> 6 Droid none
#> 7 Dug male
#> 8 Ewok male
#> 9 Geonosian male
#> 10 Gungan male
#> # … with 31 more rows
Created on 2022-04-25 by the reprex package (v2.0.1)

There are multiple options. You can use the following code:
unique(starwars[c("species", "sex")])
Output:
species sex
<chr> <chr>
1 Human male
2 Droid none
3 Human female
4 Wookiee male
5 Rodian male
6 Hutt hermaphroditic
7 Yoda's species male
8 Trandoshan male
9 Mon Calamari male
10 Ewok male
# … with 31 more rows

Related

Creating multiple frequency count tibbles at once in R

I have data on 30 people that includes ethnicity, gender, school type, whether they received free school meals, etc.
I want to produce frequency counts for all of these features. Currently my code looks like this:
df <- read.csv("~file")
df %>% select(Ethnicity) %>% group_by(Ethnicity) %>% summarise(freq = n())
df %>% select(Gender) %>% group_by(Gender) %>% summarise(freq = n())
df %>% select(School.type) %>% group_by(School.type) %>% summarise(freq = n())
Is there a way I can create a frequency tibble for 8 columns (e.g. ethnicity, gender, school type, etc.) in a more efficient way (e.g. 1 or 2 lines of code)?
As an example output for the ethnicity code:
# A tibble: 13 × 2
Ethnicity freq
<chr> <int>
1 Asian or Asian British - Bangladeshi 1
2 Asian or Asian British - Indian 7
3 Asian or Asian British - Pakistani 1
4 Black or Black British - African 5
5 Black or Black British - Caribbean 2
6 Chinese 3
7 Mixed - White and Asian 2
8 Mixed - White and Black African 1
9 Mixed - White and Black Caribbean 1
10 Not known/ prefer not to say 1
11 White British 27
12 White Irish 1
13 White Other 5
And for gender:
# A tibble: 2 × 2
Gender freq
<chr> <int>
1 Female 36
2 Male 21
NB: some columns also contain data on postcode & name which I obviously don't want to perform the frequency function on, so I think I'll somehow need to select just the columns I want to perform this function on
One option would be to use lapply to loop over a vector of your desired columns and dplyr::count for the frequency table.
Using the starwars dataset as example data:
library(dplyr, warn = FALSE)
cols <- c("hair_color", "sex")
lapply(cols, function(x) {
count(starwars, .data[[x]], name = "freq")
})
#> [[1]]
#> # A tibble: 13 × 2
#> hair_color freq
#> <chr> <int>
#> 1 auburn 1
#> 2 auburn, grey 1
#> 3 auburn, white 1
#> 4 black 13
#> 5 blond 3
#> 6 blonde 1
#> 7 brown 18
#> 8 brown, grey 1
#> 9 grey 1
#> 10 none 37
#> 11 unknown 1
#> 12 white 4
#> 13 <NA> 5
#>
#> [[2]]
#> # A tibble: 5 × 2
#> sex freq
#> <chr> <int>
#> 1 female 16
#> 2 hermaphroditic 1
#> 3 male 60
#> 4 none 6
#> 5 <NA> 4

How to control the fill_gaps interval in tsibble?

I have two data frames that fill missing in different intervals.
I would like to fill the two to the same interval.
Consider two data frames with the same month-day but two years apart:
library(tidyverse)
library(fpp3)
df_2020 <- tibble(month_day = as_date(c('2020-1-1','2020-2-1','2020-3-1')),
amount = c(5, 2, 1))
df_2022 <- tibble(month_day = as_date(c('2022-1-1','2022-2-1','2022-3-1')),
amount = c(5, 2, 1))
These data frames both have three rows, with the same dates, 2 years apart.
Create tsibbles with a yearweek index:
ts_2020 <- df_2020 |> mutate(year_week = yearweek(month_day)) |>
as_tsibble(index = year_week)
ts_2022 <- df_2022 |> mutate(year_week = yearweek(month_day)) |>
as_tsibble(index = year_week)
ts_2020
#> # A tsibble: 3 x 3 [4W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2020-01-01 5 2020 W01
#> 2 2020-02-01 2 2020 W05
#> 3 2020-03-01 1 2020 W09
ts_2022
#> # A tsibble: 3 x 3 [1W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2022-01-01 5 2021 W52
#> 2 2022-02-01 2 2022 W05
#> 3 2022-03-01 1 2022 W09
Still three rows in each tsibble
Now fill gaps:
ts_2020_filled <- ts_2020 |> fill_gaps()
ts_2022_filled <- ts_2022 |> fill_gaps()
ts_2020_filled
#> # A tsibble: 3 x 3 [4W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2020-01-01 5 2020 W01
#> 2 2020-02-01 2 2020 W05
#> 3 2020-03-01 1 2020 W09
ts_2022_filled
#> # A tsibble: 10 x 3 [1W]
#> month_day amount year_week
#> <date> <dbl> <week>
#> 1 2022-01-01 5 2021 W52
#> 2 NA NA 2022 W01
#> 3 NA NA 2022 W02
#> 4 NA NA 2022 W03
#> 5 NA NA 2022 W04
#> 6 2022-02-01 2 2022 W05
#> 7 NA NA 2022 W06
#> 8 NA NA 2022 W07
#> 9 NA NA 2022 W08
#> 10 2022-03-01 1 2022 W09
Here is the issue:
ts_2020_filled has 4-weekly steps, and ts_2022_filled has 1-weekly steps.
This is because the two tsibbles have different intervals:
tsibble::interval(ts_2020)
#> <interval[1]>
#> [1] 4W
tsibble::interval(ts_2022)
#> <interval[1]>
#> [1] 1W
This is because the tsibbles have different steps:
ts_2020 |>
pluck("year_week") |>
diff()
#> Time differences in weeks
#> [1] 4 4
ts_2022 |>
pluck("year_week") |>
diff()
#> Time differences in weeks
#> [1] 5 4
Therefore, the greatest common divisors are different (4 and 1). From the manual
for as_tibble:
regular Regular time interval (TRUE) or irregular (FALSE). The
interval is determined by the greatest common divisor of index column,
if TRUE.
Both tsibbles are
regular:
is_regular(ts_2020)
#> [1] TRUE
is_regular(ts_2020)
#> [1] TRUE
So, I would like to set the gap fill interval, so the periods are consistent.
I tried setting .full in fill_gaps and .regular in as_tsibble.
I could not find a way to set the interval of a tsibble.
Is there a way of manually setting the interval used by fill_gaps? Granted an interval of four weeks won't work for df_2022, but the LCM of one would work for both.

R: Count overall unique objects in column of lists

Ok, so here is my scenario: I have a dataset with a column composed of lists of words (keyword tags for YT videos, where each row is video data).
What I want to do is do a complete count of all unique object instances within these lists, for the entire column. So basically what I want in the end is a table with two fields: keyword, count.
If I just do a simple dplyr query, then it counts the list itself as a unique object. While this is also interesting, this is not what I want.
So this is the above dplyr query that I want to utilize further, but not sure how to nest unique instances within the unique lists:
vid_tag_freq = df %>%
count(tags)
To further clarify:
With a dataset like:
Tags
1 ['Dog', 'Cat', 'Mouse', 'Fish']
2 ['Cat', 'Fish']
3 ['Cat', 'Fish']
I am now getting:
Tags Count
1 ['Dog', 'Cat', 'Mouse', 'Fish'] 1
2 ['Cat', 'Fish'] 2
What I actually want:
Tags Count
1 'Cat' 3
2 'Fish' 3
3 'Dog' 1
4 'Mouse' 1
I hope that explains it lol
EDIT: This is what my data looks like, guess most are lists of lists? Maybe I should clean up [0]s as null?
[1] "[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]"
[2] "[0]"
[3] "[0]"
[4] "[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]"
[5] "[0]"
[6] "[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]"
[7] "[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]"
[8] "[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]"
[9] "[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]"
[10] "[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]"
[11] "[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]"
[12] "[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]"
[13] "[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]"
[14] "[['Blue Jays', 'Ashes', 'friends']]"
[15] "[['Robin Williams (Celebrity)', 'Peter Gzowski']]"
It would help if you could dput() some of the data for a working example. Going off the idea that you have a list column, here are a couple of general solutions you may be able to work with:
df <- tibble::tibble(
x = replicate(10, sample(state.name, sample(5:10, 1), TRUE), simplify = FALSE)
)
df
#> # A tibble: 10 × 1
#> x
#> <list>
#> 1 <chr [7]>
#> 2 <chr [7]>
#> 3 <chr [8]>
#> 4 <chr [6]>
#> 5 <chr [8]>
#> 6 <chr [8]>
#> 7 <chr [8]>
#> 8 <chr [6]>
#> 9 <chr [5]>
#> 10 <chr [10]>
# dplyr in a dataframe
df |>
tidyr::unnest(x) |>
dplyr::count(x)
#> # A tibble: 36 × 2
#> x n
#> <chr> <int>
#> 1 Alabama 1
#> 2 Alaska 1
#> 3 Arkansas 4
#> 4 California 3
#> 5 Colorado 5
#> 6 Connecticut 1
#> 7 Delaware 3
#> 8 Florida 1
#> 9 Georgia 3
#> 10 Hawaii 2
#> # … with 26 more rows
# vctrs
vctrs::vec_count(unlist(df$x))
#> key count
#> 1 Colorado 5
#> 2 Louisiana 5
#> 3 North Dakota 4
#> 4 Mississippi 4
#> 5 Arkansas 4
#> 6 Delaware 3
#> 7 Vermont 3
#> 8 Minnesota 3
#> 9 Utah 3
#> 10 California 3
#> 11 Georgia 3
#> 12 Indiana 2
#> 13 Missouri 2
#> 14 New Hampshire 2
#> 15 Maryland 2
#> 16 Nebraska 2
#> 17 Hawaii 2
#> 18 New Jersey 2
#> 19 Oklahoma 2
#> 20 Massachusetts 1
#> 21 Illinois 1
#> 22 Texas 1
#> 23 Connecticut 1
#> 24 Rhode Island 1
#> 25 Michigan 1
#> 26 New York 1
#> 27 Ohio 1
#> 28 Nevada 1
#> 29 Florida 1
#> 30 Montana 1
#> 31 Wisconsin 1
#> 32 Alabama 1
#> 33 Alaska 1
#> 34 North Carolina 1
#> 35 Washington 1
#> 36 Kansas 1
Created on 2022-10-07 with reprex v2.0.2
Edit
If you list is actually a character vector, you'll need to do some string parsing.
# "list" but are actually strings
x <- c(
"[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]",
"[0]",
"[0]",
"[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]",
"[0]",
"[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]",
"[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]",
"[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]",
"[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]",
"[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]",
"[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]",
"[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]",
"[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]",
"[['Blue Jays', 'Ashes', 'friends']]",
"[['Robin Williams (Celebrity)', 'Peter Gzowski']]"
)
# assing to a data.frame
df <- data.frame(x = x)
df |>
dplyr::mutate(
# remove square brackets at beginning or end
x = gsub("^\\[{1,2}|\\]{1,2}$", "", x),
# separate the strings into an actual list
x = strsplit(x, "',\\s|,\\s'")
) |>
# unnuest the list column so they appear as individual rows
tidyr::unnest(x) |>
# some extract cleaning to string out the '
dplyr::mutate(x = gsub("^'|'$", "", x)) |>
# count the individual elements
dplyr::count(x, sort = TRUE)
#> # A tibble: 47 × 2
#> x n
#> <chr> <int>
#> 1 0 3
#> 2 CBC Television (TV Network) 3
#> 3 Canadian Broadcasting Corporation (TV Network) 2
#> 4 Documentary (TV Genre) 2
#> 5 Rob Ford (Politician) 2
#> 6 the fifth estate 2
#> 7 Amanda Todd 1
#> 8 Ashes 1
#> 9 B.C. 1
#> 10 Bill Blair 1
#> # … with 37 more rows
# same result just working with the vector
x |>
gsub("^\\[{1,2}|\\]{1,2}$", "", x = _) |>
strsplit("',\\s|,\\s'") |>
unlist() |>
gsub("^'|'$", "", x = _) |>
vctrs::vec_count() # or table()
#> key count
#> 1 CBC Television (TV Network) 3
#> 2 0 3
#> 3 Rob Ford (Politician) 2
#> 4 the fifth estate 2
#> 5 Documentary (TV Genre) 2
#> 6 Canadian Broadcasting Corporation (TV Network) 2
#> 7 City Hall (Building) 1
#> 8 United States Of America (Country) 1
#> 9 Mark Kelley 1
#> 10 Israel (Country) 1
#> 11 Bill Blair 1
#> 12 Interview (TV Genre) 1
#> 13 Blue Jays 1
#> 14 Hockey (Sport) 1
#> 15 friends 1
#> 16 Peter Gzowski 1
#> 17 Suicide Of Amanda Todd (Event) 1
#> 18 greenhouse gas 1
#> 19 Dog Walking (Profession) 1
#> 20 Flood (Disaster Type) 1
#> 21 National Hockey League (Sports Association) 1
#> 22 Amanda Todd 1
#> 23 Chair 1
#> 24 dog walker 1
#> 25 bomb threat 1
#> 26 dogs 1
#> 27 Climate Change (Website Category) 1
#> 28 Robin Williams (Celebrity) 1
#> 29 Margaret Chan (Physician) 1
#> 30 cyberbullying 1
#> 31 Ashes 1
#> 32 Ontario (City/Town/Village) 1
#> 33 Iraq (Country) 1
#> 34 WHO 1
#> 35 cappers 1
#> 36 Gillian Findlay 1
#> 37 Military (Film Genre) 1
#> 38 CBC News (Website Owner) 1
#> 39 B.C. 1
#> 40 Ebola (Disease Or Medical Condition) 1
#> 41 Toronto (City/Town/Village) 1
#> 42 death 1
#> 43 emissions 1
#> 44 Rocket (Product Category) 1
#> 45 Gaza War (Military Conflict) 1
#> 46 dead 1
#> 47 Burlington (City/Town/Village) 1
Created on 2022-10-08 with reprex v2.0.2
It looks like you need unnest_longer():
library(dplyr)
library(tidyr)
df <- tibble(
Tags = list(
list('Dog', 'Cat', 'Mouse', 'Fish'),
list('Cat', 'Fish'),
list('Cat', 'Fish')
)
)
df %>%
tidyr::unnest_longer(Tags) %>%
count(Tags) %>%
arrange(desc(n))
#> # A tibble: 4 × 2
#> Tags n
#> <chr> <int>
#> 1 Cat 3
#> 2 Fish 3
#> 3 Dog 1
#> 4 Mouse 1

Show me a better way! How to unnest a heavily nested list in R

I will start off by stating that I have working code, but it is embarrassingly inefficient and clumsy. I was hoping that someone in the community might be able to show me a better way to unnest this heavily nested list.
As a background, it is transaction data on nfts that is heavily nested. I am just trying to get a data frame out, ultimately down to the daily level. I have managed to get the code working for the totalPriceUSD field, but as I mentioned, it is clumsy.
library(dplyr)
library(tidyr)
library(rlist)
library(jsonlite)
mydata <- fromJSON("https://api2.cryptoslam.io/api/nft-indexes/NFTGlobal")
#attempt at nested extraction
mydata <- rlist::list.flatten(mydata) %>% dplyr::bind_rows()
mydata <- select(mydata1, contains("totalPriceUSD"))
mydata <- select(mydata1, contains("daily"))
#change row name
rownames(mydata) <- "totalPriceUSD"
names(mydata) <- substring(names(mydata),24,33)
#change col names
names(mydata) <- format(as.Date(names(mydata), format = "%Y-%m-%d"))
mydata1 <- mydata %>%
gather(date, totalPriceUSD)
mydata <- as.data.frame(mydata)
mydata$date <- as.Date(mydata$date, format = "%Y-%m-%d")
As I said, it works, but it ain't pretty. Any suggestions on improving this?
Many thanks
library(dplyr)
mydata <- jsonlite::fromJSON("https://api2.cryptoslam.io/api/nft-indexes/NFTGlobal")
monthly <- bind_rows(lapply(mydata, `[[`, "monthlySummary"), .id = "monthly_id")
daily <- bind_rows(lapply(mydata, function(z) bind_rows(z[["dailySummaries"]], .id = "daily_id")), .id = "monthly_id")
monthly
# # A tibble: 60 x 6
# monthly_id totalTransactions uniqueBuyers uniqueSellers totalPriceUSD isRollingHoursData
# <chr> <int> <int> <int> <dbl> <lgl>
# 1 2017-06 193 33 32 11570. FALSE
# 2 2017-07 613 61 57 89111. FALSE
# 3 2017-08 113 36 31 15133. FALSE
# 4 2017-09 63 22 19 5154. FALSE
# 5 2017-10 52 17 11 3041. FALSE
# 6 2017-11 7259 1077 508 72760. FALSE
# 7 2017-12 265412 53406 23137 18804813. FALSE
# 8 2018-01 30693 7682 4582 1360558. FALSE
# 9 2018-02 34177 4142 4364 2931369. FALSE
# 10 2018-03 29051 3752 2784 987256. FALSE
# # ... with 50 more rows
daily
# # A tibble: 1,750 x 7
# monthly_id daily_id totalTransactions uniqueBuyers uniqueSellers totalPriceUSD isRollingHoursData
# <chr> <chr> <int> <int> <int> <dbl> <lgl>
# 1 2017-06 2017-06-23T00:00:00 27 9 6 1456. FALSE
# 2 2017-06 2017-06-24T00:00:00 15 7 8 846. FALSE
# 3 2017-06 2017-06-25T00:00:00 15 7 5 594. FALSE
# 4 2017-06 2017-06-26T00:00:00 23 10 12 1076. FALSE
# 5 2017-06 2017-06-27T00:00:00 35 8 15 2091. FALSE
# 6 2017-06 2017-06-28T00:00:00 15 6 5 1431. FALSE
# 7 2017-06 2017-06-29T00:00:00 41 13 11 2302. FALSE
# 8 2017-06 2017-06-30T00:00:00 22 11 7 1775. FALSE
# 9 2017-07 2017-07-01T00:00:00 12 7 10 3727. FALSE
# 10 2017-07 2017-07-02T00:00:00 34 13 12 3117. FALSE
# # ... with 1,740 more rows
An alternative to #r2evans answer using rrapply() + unnest_wider(). This should generalize to arbitrary levels of nesting as well.
library(tidyr)
library(jsonlite)
library(rrapply)
mydata <- fromJSON("https://api2.cryptoslam.io/api/nft-indexes/NFTGlobal")
monthly <- rrapply(mydata, classes = "list", condition = \(x, .xname) .xname == "monthlySummary", how = "melt") |>
unnest_wider(value)
daily <- rrapply(mydata, classes = "list", condition = \(x, .xparents) "dailySummaries" %in% head(.xparents, -1), how = "melt") |>
unnest_wider(value)
monthly
#> # A tibble: 60 × 9
#> L1 L2 totalTransactio… uniqueBuyers uniqueSellers totalPriceUSD
#> <chr> <chr> <int> <int> <int> <dbl>
#> 1 2017-06 monthlySum… 193 33 32 11570.
#> 2 2017-07 monthlySum… 613 61 57 89111.
#> 3 2017-08 monthlySum… 113 36 31 15133.
#> 4 2017-09 monthlySum… 63 22 19 5154.
#> 5 2017-10 monthlySum… 52 17 11 3041.
#> 6 2017-11 monthlySum… 7259 1077 508 72760.
#> 7 2017-12 monthlySum… 265412 53406 23137 18804813.
#> 8 2018-01 monthlySum… 30693 7682 4582 1360558.
#> 9 2018-02 monthlySum… 34177 4142 4364 2931369.
#> 10 2018-03 monthlySum… 29051 3752 2784 987256.
#> # … with 50 more rows, and 3 more variables: isRollingHoursData <lgl>,
#> # productNames <lgl>, productNamesWithoutAnySale <lgl>
daily
#> # A tibble: 1,750 × 10
#> L1 L2 L3 totalTransactio… uniqueBuyers uniqueSellers totalPriceUSD
#> <chr> <chr> <chr> <int> <int> <int> <dbl>
#> 1 2017-06 dail… 2017… 27 9 6 1456.
#> 2 2017-06 dail… 2017… 15 7 8 846.
#> 3 2017-06 dail… 2017… 15 7 5 594.
#> 4 2017-06 dail… 2017… 23 10 12 1076.
#> 5 2017-06 dail… 2017… 35 8 15 2091.
#> 6 2017-06 dail… 2017… 15 6 5 1431.
#> 7 2017-06 dail… 2017… 41 13 11 2302.
#> 8 2017-06 dail… 2017… 22 11 7 1775.
#> 9 2017-07 dail… 2017… 12 7 10 3727.
#> 10 2017-07 dail… 2017… 34 13 12 3117.
#> # … with 1,740 more rows, and 3 more variables: isRollingHoursData <lgl>,
#> # productNames <lgl>, productNamesWithoutAnySale <lgl>

`dplyr::select` without reordering columns

I am looking for an easy, concise way to use dplyr::select without rearranging columns.
Consider this dataset:
library(tidyverse)
head(msleep)
#> # A tibble: 6 × 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
#> 2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
#> 3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
#> 4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
#> 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
#> 6 Three-… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
#> # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
If I select vore, genus and name, the resulting dataframe is arranged in the order in which the columns were provided.
msleep %>% select(vore, genus, name)
#> # A tibble: 83 × 3
#> vore genus name
#> <chr> <chr> <chr>
#> 1 carni Acinonyx Cheetah
#> 2 omni Aotus Owl monkey
#> 3 herbi Aplodontia Mountain beaver
#> 4 omni Blarina Greater short-tailed shrew
#> 5 herbi Bos Cow
#> 6 herbi Bradypus Three-toed sloth
#> 7 carni Callorhinus Northern fur seal
#> 8 <NA> Calomys Vesper mouse
#> 9 carni Canis Dog
#> 10 herbi Capreolus Roe deer
#> # … with 73 more rows
I would instead like to leave them in their default order: name, genus, then vore.
I have a solution (see below), but I do not like it because it is quite wordy, and not completely “tidyverse-esque”.
(I am teaching an intro to tidyverse course, and would like something that would not intimidate beginners.)
msleep %>%
select(all_of(names(msleep)[names(msleep) %in% c("vore", "genus", "name")]))
#> # A tibble: 83 × 3
#> name genus vore
#> <chr> <chr> <chr>
#> 1 Cheetah Acinonyx carni
#> 2 Owl monkey Aotus omni
#> 3 Mountain beaver Aplodontia herbi
#> 4 Greater short-tailed shrew Blarina omni
#> 5 Cow Bos herbi
#> 6 Three-toed sloth Bradypus herbi
#> 7 Northern fur seal Callorhinus carni
#> 8 Vesper mouse Calomys <NA>
#> 9 Dog Canis carni
#> 10 Roe deer Capreolus herbi
#> # … with 73 more rows
Is there such a thing? Thank you!
For context: In reality, we have a data frame with about 400 columns, from which we are selecting ~10-20 at a time to work with. The order of the columns in the original data frame is meaningful, but we don't want to have to labor over listing them in their correct order in the select statements. A very specific need, I'll admit.
Created on 2021-12-22 by the reprex package (v2.0.1)
We could use match with sort
library(dplyr)
msleep %>%
select(sort(match(c("vore", "genus", "name"), names(.))))
EDIT: Based on the OP's comments
Update:
In case of providing a vector we could do as akrun suggests in the comments:
nm1 <- c("vore", "genus", "name"); pattern <- str_c(nm1, collapse="|")
Original answer:
You could first define a string with the search items
and then use matches
pattern <- c("vore|genus|name")
select(msleep, matches(pattern))
name genus vore
<chr> <chr> <chr>
1 Cheetah Acinonyx carni
2 Owl monkey Aotus omni
3 Mountain beaver Aplodontia herbi
4 Greater short-tailed shrew Blarina omni
5 Cow Bos herbi
6 Three-toed sloth Bradypus herbi
7 Northern fur seal Callorhinus carni
8 Vesper mouse Calomys NA
9 Dog Canis carni
10 Roe deer Capreolus herbi
You can use the power of eval_select() to create a function to select and sort the columns.
library(dplyr)
select_in_order <- function(data, ...) {
ordered_cols <- sort(tidyselect::eval_select(expr(c(...)), data))
select(data, ordered_cols)
}
So now this will do what you are asking. The benefit is that it will be "full feature" to what you are used to being able to enter into a select() statement.
# library(ggplot2) # msleep is in ggplot2
msleep %>%
select_in_order(vore, genus, name)
# this will work as well
msleep %>%
select_in_order(starts_with("sleep"), vore, name:genus)
EDIT
As another option, simply use relocate() after your select() statement. This alternative approach accomplishes your end goal of keeping the columns in order in a way that is easy to understand by a beginner.
msleep %>%
select(vore, genus, name) %>%
relocate(any_of(names(msleep)))

Resources