Using "contain" function with two arguments in R - r

I have a dataset f.ex. like this:
dat1 <- read.table(header=TRUE, text="
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
")
I'd like to use the select function to gather the variables that contain Trust AND T1.
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust")))
Does anybody know how to use two Arguments there, to have Trust AND T1. If I use:
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust"), contains("T1")))
it gives me the Variables that contain EITHER Trust or T1.
best!

If we need both, then use matches with a regex to specify the column names that starts (^) with 'Trust' and ends ($) as 'T1' (assuming these are only patterns
library(dplyr)
dat1 %>%
select(matches("^Trust_.*T1$"))
The mutate used to create a new column is not clear as there are multiple columns that matches the 'Trust' followed by 'T1'. If the intention is to do some operations on the selected columns, can either be across or c_across with rowwise (not clear from the post)

One solution could be:
library(dplyr)
df %>% select(starts_with('Trust') | contains('_T1'))
#> Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2
#> 1 5 1 2 1 5 3
#> 2 3 1 3 3 4 2
#> 3 2 1 3 1 3 1
#> 4 4 2 5 5 3 2
#> 5 5 1 4 1 2 2
#> Cont_01_T1
#> 1 1
#> 2 1
#> 3 2
#> 4 3
#> 5 4
DATA
df <- read.table(text =
"
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
", header =T)

Related

Changing specific element in various variables

I've got variables in a dataset like this:
dat1 <- read.table(header=TRUE, text="
comp_T1_01 comp_T1_02 comp_T1_03 res_T1_01 res_T1_02 res_T1_03 res_T1_04
1 1 2 1 5 5 5
2 1 3 3 4 4 1
3 1 3 1 3 2 2
4 2 5 5 3 2 2
5 1 4 1 2 1 3
")
I would like erase the "T1" of all the variables at once. As I have over 100 Variables the "colnames" would be a bit too complicated.
Is there a command that can do that?
Thank you!
You can use sub :
names(dat1) <- sub('_T1', '', names(dat1))
dat1
# comp_01 comp_02 comp_03 res_01 res_02 res_03 res_04
#1 1 1 2 1 5 5 5
#2 2 1 3 3 4 4 1
#3 3 1 3 1 3 2 2
#4 4 2 5 5 3 2 2
#5 5 1 4 1 2 1 3
In dplyr, you can use rename_with :
library(dplyr)
dat1 %>% rename_with(~sub('_T1', '', .))
We can use str_remove
library(dplyr)
library(stringr)
dat1 %>%
rename_all(~ str_remove(., '_T1'))
this may also work,
names(dat1) <- gsub(x = names(dat1), pattern = "\\_T1", replacement = "")
dat1
comp_01 comp_02 comp_03 res_01 res_02 res_03 res_04
1 1 1 2 1 5 5 5
2 2 1 3 3 4 4 1
3 3 1 3 1 3 2 2
4 4 2 5 5 3 2 2
5 5 1 4 1 2 1 3

Convert an integer from a scraped string in a database in r

I am struggling to find a way to convert a string that has both numbers and letters into just a number in R. I web-scraped data, and now want to convert one column from a string into a number. The last column of my df, Clean.data$Drafted..tm.rnd.yr currently reads like, "Arizona / 1st / 5th pick / 2011". I am trying to extract the pick number, so for that example, I would want to just extract "5". Is there anyway to do this? I am fairly new to R.
library(rvest)
library(magrittr)
library(dplyr)
library(purrr)
years <- 2010:2020
urls <- paste0(
'https://www.pro-football-reference.com/draft/',
years,
'-combine.htm')
combine.data <- map(
urls,
~read_html(.x) %>%
html_nodes(".stats_table") %>%
html_table() %>%
as.data.frame()
) %>%
set_names(years) %>%
bind_rows(.id = "year") %>%
filter(Pos == 'CB' | Pos == "S")
Clean.data <- combine.data[!rowSums(combine.data == "")> 0,]
This is my code so far.
You can use regex to extract the relevant number from the data.
Clean.data$pick_number <- as.integer(sub('.*?/\\s(\\d+).*', '\\1',
Clean.data$Drafted..tm.rnd.yr.))
Clean.data$pick_number
# [1] 5 2 5 3 1 1 4 1 5 3 3 4 1 4 3 5 3 2 2 4 3 1 5 1 5 7 2
# [28] 5 3 7 1 2 3 4 7 7 2 3 3 5 3 5 7 3 2 2 5 3 5 4 4 6 1 3
# [55] 6 7 6 4 2 4 3 2 6 5 2 3 5 3 1 2 2 4 3 1 3 6 4 6 2 2 2
# [82] 4 1 6 3 3 4 5 2 1 3 3 7 3 1 2 1 4 4 5 3 1 2 4 3 2 7 3
#[109] 3 4 5 2 4 5 1 7 2 6 5 4 2 6 4 4 5 4
This extracts the digits after the first "/".

Using loops with mutate in R to sum columns with partially matching column names

df <- data.frame(x_1_jr=c(1,2,3,4), x_2_jr=c(1,2,3,4), y_1_jr=c(4,3,2,1), y_2_jr=c(4,3,2,1)
x_1_jr x_2_jr y_1_jr y_2_jr
1 1 1 4 4
2 2 2 3 3
3 3 3 2 2
4 4 4 1 1
I am trying to generate new variables that are the sum of x and y with the same column name suffix, i.e.
df <- df %>% mutate(z_1_jr= x_1_jr + y_1_jr)
x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr
1 1 1 4 4 5
2 2 2 3 3 5
3 3 3 2 2 5
4 4 4 1 1 5
I could write this out for each variable combination, but I have a large number of variables(>50 for each x and y group), and would like to use a loop... however, I'm relatively new to R and am not sure where to begin!
Can someone help? Thank you!
EDIT: for additional clarity, the dataset contains other non-numeric variables. There are >700 columns (from a large survey). x_1_jr represents, for example, the number of male individuals ages 1 year, y_1_jr female individuals of 1 year. I am trying to get a total (male plus female of 1 year) for each age group.
A
An option with base R
df[c("z_1_jr", "z_2_jr")] <- sapply(split.default(df,
sub("^[a-z]+_", "", names(df))), rowSums)
df
# x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr z_2_jr
#1 1 1 4 4 5 5
#2 2 2 3 3 5 5
#3 3 3 2 2 5 5
#4 4 4 1 1 5 5
One dplyr and purrr option could be:
df %>%
bind_cols(map_dfc(.x = unique(sub(".*?_", "_", names(df))),
~ df %>%
transmute(!!paste0("z", .x) := rowSums(select(., ends_with(.x))))))
x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr z_2_jr
1 1 1 4 4 5 5
2 2 2 3 3 5 5
3 3 3 2 2 5 5
4 4 4 1 1 5 5

Can I use Boolean operators with R tidy select functions

is there a way I can use Boolean operators (e.g. | or &) with the tidyselect helper functions to select variables?
The code below illustrates what currently works and what, in my mind, should work but doesn't.
df<-sample(seq(1,4,1), replace=T, size=400)
df<-data.frame(matrix(df, ncol=10))
#make variable names
library(tidyverse)
library(stringr)
vars1<-str_c('q1_', seq(1,5,1))
vars2<-str_c('q9_', seq(1,5,1))
#Assign
names(df)<-c(vars1, vars2)
names(df)
#This works
df %>%
select(starts_with('q1_'), starts_with('q9'))
#This does not work using |
df %>%
select(starts_with('q1_'| 'q9_'))
#This does not work with c()
df %>%
select(starts_with(c('q1_', 'q9_')))
You can use multiple starts_with, e.g.,
df %>% select(starts_with('q1_'), starts_with('q9_'))
You can use | in a regular expression and matches() (in this case, in combination with ^, the regex beginning-of-string)
df %>% select(matches('^q1_|^q9_'))
You can also approach it using purrr:
map(.x = c("q1_", "q9_"), ~ df %>%
select(starts_with(.x))) %>%
bind_cols()
q1_1 q1_2 q1_3 q1_4 q1_5 q9_1 q9_2 q9_3 q9_4 q9_5
1 2 4 3 1 2 2 3 1 1 3
2 1 3 3 4 4 3 2 2 1 3
3 2 2 3 4 3 4 1 3 2 4
4 1 2 4 2 4 3 3 1 3 3
5 3 1 2 3 3 2 2 3 3 3
6 4 2 3 4 1 4 2 4 2 4
7 3 1 4 1 4 2 4 4 1 2
8 2 2 3 2 1 3 3 3 1 4
9 1 4 2 3 4 4 1 1 3 4
10 1 1 2 4 1 1 4 4 1 2

R: Assign incremental ids based on the groups [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 3 years ago.
I have the following sample data frame:
> test = data.frame(UserId = sample(1:5, 10, replace = T)) %>% arrange(UserId)
> test
UserId
1 1
2 1
3 1
4 1
5 1
6 3
7 4
8 4
9 4
10 5
I now want another column called loginCount for that user, which is something like assigning incremental ids within each group, something like below. Using the mutate like below creates id within each group, but how do I get the incremental ids within each group independent of each other ?
> test %>% mutate(loginCount = group_indices_(test, .dots = "UserId"))
UserId loginCount
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 3 2
7 4 3
8 4 3
9 4 3
10 5 4
I want something like shown below:
UserId loginCount
1 1
1 2
1 3
1 4
1 5
3 1
4 1
4 2
4 3
5 1
You could group and use row_number:
test %>%
arrange(UserId) %>%
group_by(UserId) %>%
mutate(loginCount = row_number()) %>%
ungroup()
# A tibble: 10 x 2
# Groups: UserId [4]
UserId loginCount
<int> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 3 1
7 4 1
8 4 2
9 4 3
10 5 1
One solution using base R tapply()
test$loginCount <- unlist(tapply(rep(1, nrow(test)), test$UserId, cumsum))
> test
UserId loginCount
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 3 1
7 4 1
8 4 2
9 4 3
10 5 1

Resources