Issue creating statcast database with BaseballR Package - r

I am trying to create a database of all MLB statcast outcomes. For this, I am using the baseballr package made by Bill Petti https://billpetti.github.io/2020-05-26-build-statcast-database-rstats-version-2.0/. I am not connecting to a SQL database but simply making a data frame in R. I want to collect all statcast data from 2019 and 2020. First, I loaded in the necessary packages.
library(baseballr)
library(tidyverse)
Then I executed the annual_statcast_query function:
annual_statcast_query <- function(season) {
dates <- seq.Date(as.Date(paste0(season, '-03-01')),
as.Date(paste0(season, '-12-01')), by = 'week')
date_grid <- tibble(start_date = dates,
end_date = dates + 6)
safe_savant <- safely(scrape_statcast_savant)
payload <- map(.x = seq_along(date_grid$start_date),
~{message(paste0('\nScraping week of ', date_grid$start_date[.x], '...\n'))
payload <- safe_savant(start_date = date_grid$start_date[.x],
end_date = date_grid$end_date[.x], type = 'pitcher')
return(payload)
})
payload_df <- map(payload, 'result')
number_rows <- map_df(.x = seq_along(payload_df),
~{number_rows <- tibble(week = .x,
number_rows = length(payload_df[[.x]]$game_date))}) %>%
filter(number_rows > 0) %>%
pull(week)
payload_df_reduced <- payload_df[number_rows]
combined <- payload_df_reduced %>%
bind_rows()
return(combined)
}
When I ran his code for the 2019 season payload <- annual_statcast_query(2019), I could scrape the data without any problems. However, when I tried it for 2020 payload <- annual_statcast_query(2020) I encountered the error:
Error: Can't combine `spin_rate_deprecated` <logical> and `spin_rate_deprecated` <character>.
This error occurs in the last part of the annual_statcast_query function:
combined <- payload_df_reduced %>%
bind_rows()
When reading through the statcast documentation (https://baseballsavant.mlb.com/csv-docs), it appears that the variable spin_rate_depreceated was replaced by release_spin. Perhaps this is why I am encountering this error. I do not need this variable for my analysis, and the error tracing I did made it very obvious that fixing the problem is beyond my skill set as a college student.
> rlang::last_error()
<error/vctrs_error_incompatible_type>
Can't combine `spin_rate_deprecated` <logical> and `spin_rate_deprecated` <character>.
Backtrace:
1. global::annual_statcast_query(2020)
3. dplyr::bind_rows(.)
4. vctrs::vec_rbind(!!!dots, .names_to = .id)
6. vctrs::vec_default_ptype2(...)
7. vctrs:::vec_ptype2_df_fallback(x, y, opts)
8. vctrs:::vec_ptype2_params(...)
9. vctrs:::vec_ptype2_opts(x, y, opts = opts, x_arg = x_arg, y_arg = y_arg)
11. vctrs::vec_default_ptype2(...)
12. vctrs::stop_incompatible_type(...)
13. vctrs:::stop_incompatible(...)
14. vctrs:::stop_vctrs(...)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/vctrs_error_incompatible_type>
Can't combine `spin_rate_deprecated` <logical> and `spin_rate_deprecated` <character>.
Backtrace:
x
1. +-global::annual_statcast_query(2020)
2. | \-payload_df_reduced %>% bind_rows()
3. \-dplyr::bind_rows(.)
4. \-vctrs::vec_rbind(!!!dots, .names_to = .id)
5. \-(function () ...
6. \-vctrs::vec_default_ptype2(...)
7. \-vctrs:::vec_ptype2_df_fallback(x, y, opts)
8. \-vctrs:::vec_ptype2_params(...)
9. \-vctrs:::vec_ptype2_opts(x, y, opts = opts, x_arg = x_arg, y_arg = y_arg)
10. \-(function () ...
11. \-vctrs::vec_default_ptype2(...)
12. \-vctrs::stop_incompatible_type(...)
13. \-vctrs:::stop_incompatible(...)
14. \-vctrs:::stop_vctrs(...)
Therefore, I tried to drop this variable from my database before the bind rows operation to avoid the error.
combined <- payload_df_reduced %>%
payload_df_reduced[ , !names(payload_df_reduced) %in% c("spin_rate_deprecated")] %>%
bind_rows()
However, this returned the error message:
Error in .[payload_df_reduced, , !names(payload_df_reduced) %in% c("spin_rate_deprecated")] :
incorrect number of dimensions
I am running
packageVersion("baseballr") [1] ‘0.8.3’
On R 4.03
If anyone could help me find a way to do this, that would be amazing. I am not picky about how I get this data, so I am all ears if anyone has an idea. Thank you so much!

To drop a column from data.frame you should do this:
payload_df_reduced %>%
select(-c(spin_rate_deprecated))
or if using your current way it should be like this
payload_df_reduced[ , !names(payload_df_reduced) %in% c("spin_rate_deprecated")]
Your current code is not work because it is incorrect grammar.
It seem that your payload_df_reduced is a list of data.frame not one data.frame. I tried to run your code but it seem you have other functions so not reproducible. Here is a theory code that you may need to adjust a bit.
combined <- map(payload_df_reduced, select, -c(spin_rate_deprecated)) %>%
bind_rows()

Related

What is generating the error 'Can't subset `.data` outside of a data mask context' with 'dplyr'?

I have a huge shiny app which uses a huge package. I'm not the author of any of them and I'm a bit lost. A function (fermentationPlot) throws the error: Can't subset .data outside of a data mask context:
Warning: Error in fermentationPlot: Can't subset `.data` outside of a data mask context.
185: <Anonymous>
173: dplyr::arrange
172: dplyr::mutate
171: as.data.frame
What could be the cause of this error? What does it mean? Below is the code block which generates it. I googled this error message and I found that it can be fixed by downgrading 'dplyr'. I tried 1.0.10, 1.0.5 and 1.0.0, and the error always occurs.
plotInfo <- dplyr::left_join(
x = dplyr::select(
plotDefaults, -c(.data$templateName, .data$minValue, .data$maxValue)
),
y = plotSettings,
by = .data$dataName
) %>%
dplyr::arrange(!is.na(.data$order), -.data$order) %>%
dplyr::mutate(
color = replace(.data$color, .data$color == "Blue", "Dark blue"),
minValue = as.numeric(.data$minValue),
maxValue = as.numeric(.data$maxValue)
) %>%
as.data.frame()
The by argument of left_join must be a character vector of column names. Probably the author wanted to do
by = "dataName"
and not
by = .data$dataName

Column name being duplicated in recipe

This is the piece of code i'm having troubles with:
pump_recipe <- recipe(status_group ~ ., data = data) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_knn(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
prepared_rec <- prep(pump_recipe)
The error:
Error:
! Column name `funder_W.D...I.` must not be duplicated.
Use .name_repair to specify repair.
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
* "funder_W.D...I." at locations 1807 and 1808.
Backtrace:
1. recipes::prep(pump_recipe)
2. recipes:::prep.recipe(pump_recipe)
4. recipes:::bake.step_dummy(x$steps[[i]], new_data = training)
8. tibble:::as_tibble.data.frame(indicators)
9. tibble:::lst_to_tibble(unclass(x), .rows, .name_repair)
...
16. vctrs `<fn>`()
17. vctrs:::validate_unique(names = names, arg = arg)
18. vctrs:::stop_names_must_be_unique(names, arg)
19. vctrs:::stop_names(...)
20. vctrs:::stop_vctrs(class = c(class, "vctrs_error_names"), ...)
Error:
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
* "funder_W.D...I." at locations 1807 and 1808.
So basically it seems like the step_dummy step is doing something strange, and creating a duplicated column here. I don't know why this is happening. This is the data I'm working with:
https://github.com/norhther/datasets/blob/main/data.csv
You are having levels in funder and installer that are so similar that step_dummy() creates labels of the same name. The error says that funder_W.D...I. appears twice.
If we do some filtering on the funder column we see that there are 3 different names that match.
str_subset(data$funder, "W.D") |> unique()
[1] "W.D.&.I." "W.D & I." "W.D &"
Neither "W.D.&.I." or "W.D & I." are valid names so step_dummy() tries to fix them. This yields "funder_W.D...I." for both.
You can fix this by using textrecipes::step_clean_levels(), this make sure that the levels of these variables stay valid and non-overlapping.
library(recipes)
pump_recipe <- recipe(status_group ~ ., data = data) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_knn(all_nominal_predictors()) %>%
textrecipes::step_clean_levels(funder, installer) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
prepared_rec <- prep(pump_recipe)
Note: As you say, I would imagine that "W.D.&.I.", "W.D & I." and "W.D &" all refer to the same entity. You should take a look to see if you can collapse these levels manually.

What is the error `vec_assign()`: `value` should have been recycled to fit `x`. trying to tell me

I have a function that does some basic web harvesting. This function is called after a successful login. (website has been masked xxxxxx)
Search Function:
search <-function(HorseList){
url <- "http://tnetwork.xxxxxx.com/tnet/HorseSearch.aspx"
s <- GET(url)
xxxxxx <- tibble(
horse_name = character(),
race_date = character(),
race_nbr = character(),
trk = character(),
peak = character(),
dist_run = character()
)
for (row in 1:nrow(HorseList))
{
url <-paste(c('http://tnetwork.xxxxxx.com/tnet/HorseSearchAPI.aspx?HorseName=',toString(HorseList[[row, 1]])),collapse='')
#print(url)
h <- GET(url)
temp<-content(h, "text")
doc <- htmlParse(temp)
horse_name <- HorseList[[row,1]]
horse_ID <-xpathSApply(doc,"//*[#id=\"resultsDiv\"]/p[1]/a/#href")
horse_ID <-substr(horse_ID,27,40)
h_list <- list()
c <- nchar(horse_ID)
if (length(c)>0)
{
h_list[1] <- horse_ID
}
id_count <- length(h_list)
for (k in 1:id_count)
{
url <-paste(c('http://tnetwork.xxxxxx.com/tnet/t_PastPerf.aspx?HorseID=',toString(h_list[k])),collapse='')
t <- GET(url)
temp <- content(t, "text")
pastperf <- htmlParse(temp)
row_count<-length(xpathSApply(pastperf,"//*[#id=\"pastPerfTable\"]/tr"))
for(j in 2:row_count)
{
j<- toString(j)
race_data <- xpathSApply(pastperf,paste("//*[#id=\"pastPerfTable\"]/tr[",j,"]/td[1][1]"),xmlValue)
race_date <- substr(race_data,1,10)
race_number <-trimws(substr(race_data,12,100))
horse_name <- URLdecode(toString(horse_name))
race_nbr = str_match(race_number,'(Race\\s\\d+)(.*)')[,2]
trk = str_match(race_number,'(Race\\s\\d+)(.*)')[,3]
peak <-xpathSApply(pastperf,paste("//*[#id=\"pastPerfTable\"]/tr[",j,"]/td[13]"),xmlValue)
cum_distance <-xpathSApply(pastperf,paste("//*[#id=\"pastPerfTable\"]/tr[",j,"]/td[14]"),xmlValue)
newrow <- paste(horse_name,',',race_date,',',race_nbr,',',trk, ',',peak,',',cum_distance)
xxxxxx <- add_row(trakus, horse_name = horse_name, race_date = race_date, race_nbr = race_nbr, trk=trk, peak = peak, dist_run = cum_distance)
}
}
}
return(xxxxxx)
}
The function has worked successfully in the past, but today it is throwing the following error:
Error: Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
I ran the rlang::last_error() and last_trace() commands to gain some additional insight, but I'm still not sure what's going on.
> rlang::last_error()
<error/rlang_error>
Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
Backtrace:
1. base::source("~/TimeForm/Scripts/past_perf.R", echo = TRUE)
6. global::search(horse_list) ~/TimeForm/Scripts/past_perf.R:627:2
7. tibble::add_row(...) ~/TimeForm/Scripts/past_perf.R:85:8
8. tibble:::rbind_at(.data, df, pos)
9. vctrs::vec_rbind(old, new)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
Backtrace:
x
1. +-base::source("~/TimeForm/Scripts/past_perf.R", echo = TRUE)
2. +-base::source("~/TimeForm/Scripts/past_perf.R", echo = TRUE)
3. | +-base::withVisible(eval(ei, envir))
4. | \-base::eval(ei, envir)
5. | \-base::eval(ei, envir)
6. \-global::search(horse_list) ~/TimeForm/Scripts/past_perf.R:627:2
7. \-tibble::add_row(...) ~/TimeForm/Scripts/past_perf.R:85:8
8. \-tibble:::rbind_at(.data, df, pos)
9. \-vctrs::vec_rbind(old, new)
10. \-(function () ...
It appears the add_row() line in my code may be the culprit, but I'm not sure what the error is telling or how to fix it. Does anyone have any insights they could share?
I found that the problem occurs in the handling of empty character fields, null values or the use of NA. The problem was corrected with a map_if replacement of null values.
map_depth(.depth = 1, map_if, is_empty, ~paste0(""))
Of course, you'll need to adjust the depth command to make the correction at the appropriate level for your list construct.
Ideally, bind_rows() would handle vectors containing NA or NULL values more robustly.

Error in n() inside Summarise Function dplyr

everything good?
During that week I spent time writing a script that even this morning seemed to work. but then I tried to run it again and exactly in a part that uses the function "summarize" of the package dplyr appears an error that I had never seen.
Below is an excerpt of the code I used and the error on the console:
library(tidyverse)
a <- c(1,0,1,1,0,1,1,1,1,0,0)
b <-c( 0.9157101,
0.4854955,
0.8853174,
0.4373646,
0.3855175,
0.8603407,
0.9193342,
0.4693117,
0.9849855,
0.4458159,
0.4379776)
c <- c(8,2,7,1,0,6,8,1,9,1,1)
treated_data <- data.frame(Risk = a ,
Model_Predicted = b,
Grupo = c)
calculo <- treated_data %>% group_by(Grupo) %>% summarise(Quantidade = n(),
Non_event = sum(Risk),
Event = n() - sum(Risk))
Console Result:
---------------------------------------------------------
Error in n() : argument "vec" is missing, with no default
---------------------------------------------------------

How to extract columns from a row and save the output as a variable dplyr

I am trying to extract a specific column from a specific row on my excel sheet (df). However, when I try to do so I get the message:
Error: ... must evaluate to column positions or names, not a list
Call `rlang::last_error()` to see a backtrace.
When I call rlang::last_error() I get:
Backtrace:
1. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin)
9. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...))
10. tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n not { first_type }")
11. tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)
12. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin)
At this point, I am lost. What can I do to my code to work?
library(readxl)
Lakers_Overall_Stats <- read_excel("Desktop/Lakers Overall Stats.xlsx")
library(readxl)
Lakers_Record <- read_excel("Desktop/Lakers Record.xlsx")
require(dplyr)
require(ggplot2)
##WinPercentage of the Team after season
mydata <- Lakers_Record %>% select(Pts,Opp,W,L)%>%
+ mutate(wpct=Pts^13.91/(Pts^13.91+Opp^13.91),expwin=round(wpct*(W+L)),diff=W-expwin)
head(mydata)
##Specifiying
Lakers_Overall_Stats[23,6] <- TmMin
Lakers_Overall_Stats[23,8] <- TmFga
Lakers_Overall_Stats[23,18] <- TmFta
Lakers_Overall_Stats[23,26] <- TmTov
rlang::last_error()
##Usage Percentage
Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>%
+ mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP)
##head(Usgpct)
##filter(rank(desc(Usgpct))==1)
Also, am I filtering correctly? or should it be written as
Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>%
filter(rank(desc(Usgpct))==1)%>%
mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP)
head(Usgpct)
You have
Lakers_Overall_Stats[23,6] <- TmMin
This will modify the Lakers_Overall_Stats data frame by setting the element at 23,6 etc. to be TmMin. TmMin is an object outside of your data frame.
Maybe you want:
TmMin <- Lakers_Overall_Stats[23,6]
?
Also, you cannot select TmFga,TmFta,TmTov,TmMin since these variables are not part of your data frame. You can refer to those variables in your mutate equation, but because of the way you've set it up, they're stand-alone variables.

Resources