Can I use Boolean operators with R tidy select functions - r

is there a way I can use Boolean operators (e.g. | or &) with the tidyselect helper functions to select variables?
The code below illustrates what currently works and what, in my mind, should work but doesn't.
df<-sample(seq(1,4,1), replace=T, size=400)
df<-data.frame(matrix(df, ncol=10))
#make variable names
library(tidyverse)
library(stringr)
vars1<-str_c('q1_', seq(1,5,1))
vars2<-str_c('q9_', seq(1,5,1))
#Assign
names(df)<-c(vars1, vars2)
names(df)
#This works
df %>%
select(starts_with('q1_'), starts_with('q9'))
#This does not work using |
df %>%
select(starts_with('q1_'| 'q9_'))
#This does not work with c()
df %>%
select(starts_with(c('q1_', 'q9_')))

You can use multiple starts_with, e.g.,
df %>% select(starts_with('q1_'), starts_with('q9_'))
You can use | in a regular expression and matches() (in this case, in combination with ^, the regex beginning-of-string)
df %>% select(matches('^q1_|^q9_'))

You can also approach it using purrr:
map(.x = c("q1_", "q9_"), ~ df %>%
select(starts_with(.x))) %>%
bind_cols()
q1_1 q1_2 q1_3 q1_4 q1_5 q9_1 q9_2 q9_3 q9_4 q9_5
1 2 4 3 1 2 2 3 1 1 3
2 1 3 3 4 4 3 2 2 1 3
3 2 2 3 4 3 4 1 3 2 4
4 1 2 4 2 4 3 3 1 3 3
5 3 1 2 3 3 2 2 3 3 3
6 4 2 3 4 1 4 2 4 2 4
7 3 1 4 1 4 2 4 4 1 2
8 2 2 3 2 1 3 3 3 1 4
9 1 4 2 3 4 4 1 1 3 4
10 1 1 2 4 1 1 4 4 1 2

Related

Using "contain" function with two arguments in R

I have a dataset f.ex. like this:
dat1 <- read.table(header=TRUE, text="
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
")
I'd like to use the select function to gather the variables that contain Trust AND T1.
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust")))
Does anybody know how to use two Arguments there, to have Trust AND T1. If I use:
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust"), contains("T1")))
it gives me the Variables that contain EITHER Trust or T1.
best!
If we need both, then use matches with a regex to specify the column names that starts (^) with 'Trust' and ends ($) as 'T1' (assuming these are only patterns
library(dplyr)
dat1 %>%
select(matches("^Trust_.*T1$"))
The mutate used to create a new column is not clear as there are multiple columns that matches the 'Trust' followed by 'T1'. If the intention is to do some operations on the selected columns, can either be across or c_across with rowwise (not clear from the post)
One solution could be:
library(dplyr)
df %>% select(starts_with('Trust') | contains('_T1'))
#> Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2
#> 1 5 1 2 1 5 3
#> 2 3 1 3 3 4 2
#> 3 2 1 3 1 3 1
#> 4 4 2 5 5 3 2
#> 5 5 1 4 1 2 2
#> Cont_01_T1
#> 1 1
#> 2 1
#> 3 2
#> 4 3
#> 5 4
DATA
df <- read.table(text =
"
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
", header =T)

Changing specific element in various variables

I've got variables in a dataset like this:
dat1 <- read.table(header=TRUE, text="
comp_T1_01 comp_T1_02 comp_T1_03 res_T1_01 res_T1_02 res_T1_03 res_T1_04
1 1 2 1 5 5 5
2 1 3 3 4 4 1
3 1 3 1 3 2 2
4 2 5 5 3 2 2
5 1 4 1 2 1 3
")
I would like erase the "T1" of all the variables at once. As I have over 100 Variables the "colnames" would be a bit too complicated.
Is there a command that can do that?
Thank you!
You can use sub :
names(dat1) <- sub('_T1', '', names(dat1))
dat1
# comp_01 comp_02 comp_03 res_01 res_02 res_03 res_04
#1 1 1 2 1 5 5 5
#2 2 1 3 3 4 4 1
#3 3 1 3 1 3 2 2
#4 4 2 5 5 3 2 2
#5 5 1 4 1 2 1 3
In dplyr, you can use rename_with :
library(dplyr)
dat1 %>% rename_with(~sub('_T1', '', .))
We can use str_remove
library(dplyr)
library(stringr)
dat1 %>%
rename_all(~ str_remove(., '_T1'))
this may also work,
names(dat1) <- gsub(x = names(dat1), pattern = "\\_T1", replacement = "")
dat1
comp_01 comp_02 comp_03 res_01 res_02 res_03 res_04
1 1 1 2 1 5 5 5
2 2 1 3 3 4 4 1
3 3 1 3 1 3 2 2
4 4 2 5 5 3 2 2
5 5 1 4 1 2 1 3

Convert an integer from a scraped string in a database in r

I am struggling to find a way to convert a string that has both numbers and letters into just a number in R. I web-scraped data, and now want to convert one column from a string into a number. The last column of my df, Clean.data$Drafted..tm.rnd.yr currently reads like, "Arizona / 1st / 5th pick / 2011". I am trying to extract the pick number, so for that example, I would want to just extract "5". Is there anyway to do this? I am fairly new to R.
library(rvest)
library(magrittr)
library(dplyr)
library(purrr)
years <- 2010:2020
urls <- paste0(
'https://www.pro-football-reference.com/draft/',
years,
'-combine.htm')
combine.data <- map(
urls,
~read_html(.x) %>%
html_nodes(".stats_table") %>%
html_table() %>%
as.data.frame()
) %>%
set_names(years) %>%
bind_rows(.id = "year") %>%
filter(Pos == 'CB' | Pos == "S")
Clean.data <- combine.data[!rowSums(combine.data == "")> 0,]
This is my code so far.
You can use regex to extract the relevant number from the data.
Clean.data$pick_number <- as.integer(sub('.*?/\\s(\\d+).*', '\\1',
Clean.data$Drafted..tm.rnd.yr.))
Clean.data$pick_number
# [1] 5 2 5 3 1 1 4 1 5 3 3 4 1 4 3 5 3 2 2 4 3 1 5 1 5 7 2
# [28] 5 3 7 1 2 3 4 7 7 2 3 3 5 3 5 7 3 2 2 5 3 5 4 4 6 1 3
# [55] 6 7 6 4 2 4 3 2 6 5 2 3 5 3 1 2 2 4 3 1 3 6 4 6 2 2 2
# [82] 4 1 6 3 3 4 5 2 1 3 3 7 3 1 2 1 4 4 5 3 1 2 4 3 2 7 3
#[109] 3 4 5 2 4 5 1 7 2 6 5 4 2 6 4 4 5 4
This extracts the digits after the first "/".

Using loops with mutate in R to sum columns with partially matching column names

df <- data.frame(x_1_jr=c(1,2,3,4), x_2_jr=c(1,2,3,4), y_1_jr=c(4,3,2,1), y_2_jr=c(4,3,2,1)
x_1_jr x_2_jr y_1_jr y_2_jr
1 1 1 4 4
2 2 2 3 3
3 3 3 2 2
4 4 4 1 1
I am trying to generate new variables that are the sum of x and y with the same column name suffix, i.e.
df <- df %>% mutate(z_1_jr= x_1_jr + y_1_jr)
x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr
1 1 1 4 4 5
2 2 2 3 3 5
3 3 3 2 2 5
4 4 4 1 1 5
I could write this out for each variable combination, but I have a large number of variables(>50 for each x and y group), and would like to use a loop... however, I'm relatively new to R and am not sure where to begin!
Can someone help? Thank you!
EDIT: for additional clarity, the dataset contains other non-numeric variables. There are >700 columns (from a large survey). x_1_jr represents, for example, the number of male individuals ages 1 year, y_1_jr female individuals of 1 year. I am trying to get a total (male plus female of 1 year) for each age group.
A
An option with base R
df[c("z_1_jr", "z_2_jr")] <- sapply(split.default(df,
sub("^[a-z]+_", "", names(df))), rowSums)
df
# x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr z_2_jr
#1 1 1 4 4 5 5
#2 2 2 3 3 5 5
#3 3 3 2 2 5 5
#4 4 4 1 1 5 5
One dplyr and purrr option could be:
df %>%
bind_cols(map_dfc(.x = unique(sub(".*?_", "_", names(df))),
~ df %>%
transmute(!!paste0("z", .x) := rowSums(select(., ends_with(.x))))))
x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr z_2_jr
1 1 1 4 4 5 5
2 2 2 3 3 5 5
3 3 3 2 2 5 5
4 4 4 1 1 5 5

In R: How to coerce a list of vectors with unequal length to a dataframe using tidyverse?

Suppose you have the following list in R:
list_test <- list(c(2,4,5, 6), c(1,2,3), c(7,8))
What I am looking for is a dataframe of the following form:
value list_index
2 1
4 1
5 1
6 1
1 2
2 2
3 2
7 3
8 3
I tried to find a solution with the tidyverse but either lost the the list_index/name or had problems with the unequal length of the vectors.
You can give name to the list and then use stack in base R.
names(list_test) <- seq_along(list_test)
stack(list_test)
# values ind
#1 2 1
#2 4 1
#3 5 1
#4 6 1
#5 1 2
#6 2 2
#7 3 2
#8 7 3
#9 8 3
If interested in a tidyverse solution we can use enframe with unnest.
tibble::enframe(list_test) %>% tidyr::unnest(value)
Or imap_dfr from purrr.
purrr::imap_dfr(list_test, ~tibble::tibble(value = .x, list_index = .y))
Another option could be:
map_dfr(list_test, ~ enframe(.) %>%
select(-name), .id = "name")
name value
<chr> <dbl>
1 1 2
2 1 4
3 1 5
4 1 6
5 2 1
6 2 2
7 2 3
8 3 7
9 3 8
Or if you don't mind to have a column also with vector indexes:
map_dfr(list_test, enframe, .id = "name_list")
name_list name value
<chr> <int> <dbl>
1 1 1 2
2 1 2 4
3 1 3 5
4 1 4 6
5 2 1 1
6 2 2 2
7 2 3 3
8 3 1 7
9 3 2 8
In base R, we can use lengths to replicate the sequence and unlist the list elements into a two column 'data.frame'
data.frame(value = unlist(list_test),
list_index = rep(seq_along(list_test), lengths(list_test)))
# value list_index
#1 2 1
#2 4 1
#3 5 1
#4 6 1
#5 1 2
#6 2 2
#7 3 2
#8 7 3
#9 8 3

Resources