Convert an integer from a scraped string in a database in r - r

I am struggling to find a way to convert a string that has both numbers and letters into just a number in R. I web-scraped data, and now want to convert one column from a string into a number. The last column of my df, Clean.data$Drafted..tm.rnd.yr currently reads like, "Arizona / 1st / 5th pick / 2011". I am trying to extract the pick number, so for that example, I would want to just extract "5". Is there anyway to do this? I am fairly new to R.
library(rvest)
library(magrittr)
library(dplyr)
library(purrr)
years <- 2010:2020
urls <- paste0(
'https://www.pro-football-reference.com/draft/',
years,
'-combine.htm')
combine.data <- map(
urls,
~read_html(.x) %>%
html_nodes(".stats_table") %>%
html_table() %>%
as.data.frame()
) %>%
set_names(years) %>%
bind_rows(.id = "year") %>%
filter(Pos == 'CB' | Pos == "S")
Clean.data <- combine.data[!rowSums(combine.data == "")> 0,]
This is my code so far.

You can use regex to extract the relevant number from the data.
Clean.data$pick_number <- as.integer(sub('.*?/\\s(\\d+).*', '\\1',
Clean.data$Drafted..tm.rnd.yr.))
Clean.data$pick_number
# [1] 5 2 5 3 1 1 4 1 5 3 3 4 1 4 3 5 3 2 2 4 3 1 5 1 5 7 2
# [28] 5 3 7 1 2 3 4 7 7 2 3 3 5 3 5 7 3 2 2 5 3 5 4 4 6 1 3
# [55] 6 7 6 4 2 4 3 2 6 5 2 3 5 3 1 2 2 4 3 1 3 6 4 6 2 2 2
# [82] 4 1 6 3 3 4 5 2 1 3 3 7 3 1 2 1 4 4 5 3 1 2 4 3 2 7 3
#[109] 3 4 5 2 4 5 1 7 2 6 5 4 2 6 4 4 5 4
This extracts the digits after the first "/".

Related

How to combine columns with the same ID using R?

I want to Combine V1 and V2 with the matching ID number using R. What's the simplest way to go about it?
Below is an example how I want to combine my data. Hopefully this makes sense if not I can try to be more clear. I did try the group by but I dont know if thats the best way to go about it
ID V1 V2
1 3 2
2 3 4
3 5 1
3 2 3
4 2 3
4 5 7
4 1 3
This is what I would like it to look like
ID V3
1 3
1 2
2 3
2 4
3 5
3 1
3 2
3 3
4 2
4 3
4 5
4 7
4 1
4 3
Try using pivot_longer with names_to = NULL to remove the unwanted column.
tidyr::pivot_longer(df, V1:V2, values_to = "V3", names_to = NULL)
Output:
# ID V3
# <int> <int>
# 1 1 3
# 2 1 2
# 3 2 3
# 4 2 4
# 5 3 5
# 6 3 1
# 7 3 2
# 8 3 3
# 9 4 2
# 10 4 3
# 11 4 5
# 12 4 7
# 13 4 1
# 14 4 3
You may try
library(dplyr)
reshape2::melt(df, "ID") %>% select(ID, value) %>% arrange(ID)
ID value
1 1 3
2 1 2
3 2 3
4 2 4
5 3 5
6 3 2
7 3 1
8 3 3
9 4 2
10 4 5
11 4 1
12 4 3
13 4 7
14 4 3

Using "contain" function with two arguments in R

I have a dataset f.ex. like this:
dat1 <- read.table(header=TRUE, text="
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
")
I'd like to use the select function to gather the variables that contain Trust AND T1.
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust")))
Does anybody know how to use two Arguments there, to have Trust AND T1. If I use:
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust"), contains("T1")))
it gives me the Variables that contain EITHER Trust or T1.
best!
If we need both, then use matches with a regex to specify the column names that starts (^) with 'Trust' and ends ($) as 'T1' (assuming these are only patterns
library(dplyr)
dat1 %>%
select(matches("^Trust_.*T1$"))
The mutate used to create a new column is not clear as there are multiple columns that matches the 'Trust' followed by 'T1'. If the intention is to do some operations on the selected columns, can either be across or c_across with rowwise (not clear from the post)
One solution could be:
library(dplyr)
df %>% select(starts_with('Trust') | contains('_T1'))
#> Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2
#> 1 5 1 2 1 5 3
#> 2 3 1 3 3 4 2
#> 3 2 1 3 1 3 1
#> 4 4 2 5 5 3 2
#> 5 5 1 4 1 2 2
#> Cont_01_T1
#> 1 1
#> 2 1
#> 3 2
#> 4 3
#> 5 4
DATA
df <- read.table(text =
"
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
", header =T)

Can I use Boolean operators with R tidy select functions

is there a way I can use Boolean operators (e.g. | or &) with the tidyselect helper functions to select variables?
The code below illustrates what currently works and what, in my mind, should work but doesn't.
df<-sample(seq(1,4,1), replace=T, size=400)
df<-data.frame(matrix(df, ncol=10))
#make variable names
library(tidyverse)
library(stringr)
vars1<-str_c('q1_', seq(1,5,1))
vars2<-str_c('q9_', seq(1,5,1))
#Assign
names(df)<-c(vars1, vars2)
names(df)
#This works
df %>%
select(starts_with('q1_'), starts_with('q9'))
#This does not work using |
df %>%
select(starts_with('q1_'| 'q9_'))
#This does not work with c()
df %>%
select(starts_with(c('q1_', 'q9_')))
You can use multiple starts_with, e.g.,
df %>% select(starts_with('q1_'), starts_with('q9_'))
You can use | in a regular expression and matches() (in this case, in combination with ^, the regex beginning-of-string)
df %>% select(matches('^q1_|^q9_'))
You can also approach it using purrr:
map(.x = c("q1_", "q9_"), ~ df %>%
select(starts_with(.x))) %>%
bind_cols()
q1_1 q1_2 q1_3 q1_4 q1_5 q9_1 q9_2 q9_3 q9_4 q9_5
1 2 4 3 1 2 2 3 1 1 3
2 1 3 3 4 4 3 2 2 1 3
3 2 2 3 4 3 4 1 3 2 4
4 1 2 4 2 4 3 3 1 3 3
5 3 1 2 3 3 2 2 3 3 3
6 4 2 3 4 1 4 2 4 2 4
7 3 1 4 1 4 2 4 4 1 2
8 2 2 3 2 1 3 3 3 1 4
9 1 4 2 3 4 4 1 1 3 4
10 1 1 2 4 1 1 4 4 1 2

Reshaping different variables for selecting values from one column in R

Below, a sample of my data, I have more Rs and Os.
A R1 O1 R2 O2 R3 O3
1 3 3 5 3 6 4
2 3 3 5 4 7 4
3 4 4 5 5 6 5
I want to get the following data
A R O Value
1 3 1 3
1 5 2 3
1 6 3 4
2 3 1 3
2 5 2 4
2 7 3 4
3 4 1 4
3 5 2 5
3 6 3 5
I try the melt function, but I was unsuccessful. Any help would be very much appreciated.
A solution using dplyr and tidyr. The key is to use gather to collect all the columns other than A, and the use extract to split the column, and then use spread to convert the data frame back to wide format.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
gather(Column, Number, -A) %>%
extract(Column, into = c("Column", "ID"), regex = "([A-Z]+)([0-9]+)") %>%
spread(Column, Number) %>%
select(A, R, O = ID, Value = O)
dt2
# A R O Value
# 1 1 3 1 3
# 2 1 5 2 3
# 3 1 6 3 4
# 4 2 3 1 3
# 5 2 5 2 4
# 6 2 7 3 4
# 7 3 4 1 4
# 8 3 5 2 5
# 9 3 6 3 5
DATA
dt <- read.table(text = "A R1 O1 R2 O2 R3 O3
1 3 3 5 3 6 4
2 3 3 5 4 7 4
3 4 4 5 5 6 5",
header = TRUE)

Assign value to group based on condition in column

I have a data frame that looks like the following:
> df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
value = c(3,4,3,4,5,6,6,4,9))
> df
group date value
1 1 1 3
2 1 2 4
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 6
8 3 8 4
9 3 9 9
I want to create a new column that contains the date value per group that is associated with the value "4" from the value column.
The following data frame shows what I hope to accomplish.
group date value newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
As we can see, group 1 has the newValue "2" because that is the date associated with the value "4". Similarly, group two has newValue 4 and group three has newValue 8.
I assume there is an easy way to do this using ave() or a range of dplyr/data.table functions, but I have been unsuccessful with my many attempts.
Here's a quick data.table one
library(data.table)
setDT(df)[, newValue := date[value == 4L], by = group]
df
# group date value newValue
# 1: 1 1 3 2
# 2: 1 2 4 2
# 3: 1 3 3 2
# 4: 2 4 4 4
# 5: 2 5 5 4
# 6: 2 6 6 4
# 7: 3 7 6 8
# 8: 3 8 4 8
# 9: 3 9 9 8
Here's a similar dplyr version
library(dplyr)
df %>%
group_by(group) %>%
mutate(newValue = date[value == 4L])
Or a possible base R solution using merge after filtering the data (will need some renaming afterwards)
merge(df, df[df$value == 4, c("group", "date")], by = "group")
Here is a base R option
df$newValue = rep(df$date[which(df$value == 4)], table(df$group))
Another alternative using lapply
do.call(rbind, lapply(split(df, df$group),
function(x){x$newValue = rep(x$date[which(x$value == 4)],
each = length(x$group)); x}))
# group date value newValue
#1.1 1 1 3 2
#1.2 1 2 4 2
#1.3 1 3 3 2
#2.4 2 4 4 4
#2.5 2 5 5 4
#2.6 2 6 6 4
#3.7 3 7 6 8
#3.8 3 8 4 8
#3.9 3 9 9 8
One more base R path:
df$newValue <- ave(`names<-`(df$value==4,df$date), df$group, FUN=function(x) as.numeric(names(x)[x]))
df
group date value newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
10 3 11 7 8
I used a test on variable length groups. I assigned the date column as the names for the logical index of value equal to 4. Then identify the value by group.
Data
df = data.frame(group = c(1,1,1,2,2,2,3,3,3,3),
date = c(1,2,3,4,5,6,7,8,9,11),
value = c(3,4,3,4,5,6,6,4,9,7))

Resources