Column name being duplicated in recipe - r

This is the piece of code i'm having troubles with:
pump_recipe <- recipe(status_group ~ ., data = data) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_knn(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
prepared_rec <- prep(pump_recipe)
The error:
Error:
! Column name `funder_W.D...I.` must not be duplicated.
Use .name_repair to specify repair.
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
* "funder_W.D...I." at locations 1807 and 1808.
Backtrace:
1. recipes::prep(pump_recipe)
2. recipes:::prep.recipe(pump_recipe)
4. recipes:::bake.step_dummy(x$steps[[i]], new_data = training)
8. tibble:::as_tibble.data.frame(indicators)
9. tibble:::lst_to_tibble(unclass(x), .rows, .name_repair)
...
16. vctrs `<fn>`()
17. vctrs:::validate_unique(names = names, arg = arg)
18. vctrs:::stop_names_must_be_unique(names, arg)
19. vctrs:::stop_names(...)
20. vctrs:::stop_vctrs(class = c(class, "vctrs_error_names"), ...)
Error:
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
* "funder_W.D...I." at locations 1807 and 1808.
So basically it seems like the step_dummy step is doing something strange, and creating a duplicated column here. I don't know why this is happening. This is the data I'm working with:
https://github.com/norhther/datasets/blob/main/data.csv

You are having levels in funder and installer that are so similar that step_dummy() creates labels of the same name. The error says that funder_W.D...I. appears twice.
If we do some filtering on the funder column we see that there are 3 different names that match.
str_subset(data$funder, "W.D") |> unique()
[1] "W.D.&.I." "W.D & I." "W.D &"
Neither "W.D.&.I." or "W.D & I." are valid names so step_dummy() tries to fix them. This yields "funder_W.D...I." for both.
You can fix this by using textrecipes::step_clean_levels(), this make sure that the levels of these variables stay valid and non-overlapping.
library(recipes)
pump_recipe <- recipe(status_group ~ ., data = data) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_knn(all_nominal_predictors()) %>%
textrecipes::step_clean_levels(funder, installer) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
prepared_rec <- prep(pump_recipe)
Note: As you say, I would imagine that "W.D.&.I.", "W.D & I." and "W.D &" all refer to the same entity. You should take a look to see if you can collapse these levels manually.

Related

What is generating the error 'Can't subset `.data` outside of a data mask context' with 'dplyr'?

I have a huge shiny app which uses a huge package. I'm not the author of any of them and I'm a bit lost. A function (fermentationPlot) throws the error: Can't subset .data outside of a data mask context:
Warning: Error in fermentationPlot: Can't subset `.data` outside of a data mask context.
185: <Anonymous>
173: dplyr::arrange
172: dplyr::mutate
171: as.data.frame
What could be the cause of this error? What does it mean? Below is the code block which generates it. I googled this error message and I found that it can be fixed by downgrading 'dplyr'. I tried 1.0.10, 1.0.5 and 1.0.0, and the error always occurs.
plotInfo <- dplyr::left_join(
x = dplyr::select(
plotDefaults, -c(.data$templateName, .data$minValue, .data$maxValue)
),
y = plotSettings,
by = .data$dataName
) %>%
dplyr::arrange(!is.na(.data$order), -.data$order) %>%
dplyr::mutate(
color = replace(.data$color, .data$color == "Blue", "Dark blue"),
minValue = as.numeric(.data$minValue),
maxValue = as.numeric(.data$maxValue)
) %>%
as.data.frame()
The by argument of left_join must be a character vector of column names. Probably the author wanted to do
by = "dataName"
and not
by = .data$dataName

Issue creating statcast database with BaseballR Package

I am trying to create a database of all MLB statcast outcomes. For this, I am using the baseballr package made by Bill Petti https://billpetti.github.io/2020-05-26-build-statcast-database-rstats-version-2.0/. I am not connecting to a SQL database but simply making a data frame in R. I want to collect all statcast data from 2019 and 2020. First, I loaded in the necessary packages.
library(baseballr)
library(tidyverse)
Then I executed the annual_statcast_query function:
annual_statcast_query <- function(season) {
dates <- seq.Date(as.Date(paste0(season, '-03-01')),
as.Date(paste0(season, '-12-01')), by = 'week')
date_grid <- tibble(start_date = dates,
end_date = dates + 6)
safe_savant <- safely(scrape_statcast_savant)
payload <- map(.x = seq_along(date_grid$start_date),
~{message(paste0('\nScraping week of ', date_grid$start_date[.x], '...\n'))
payload <- safe_savant(start_date = date_grid$start_date[.x],
end_date = date_grid$end_date[.x], type = 'pitcher')
return(payload)
})
payload_df <- map(payload, 'result')
number_rows <- map_df(.x = seq_along(payload_df),
~{number_rows <- tibble(week = .x,
number_rows = length(payload_df[[.x]]$game_date))}) %>%
filter(number_rows > 0) %>%
pull(week)
payload_df_reduced <- payload_df[number_rows]
combined <- payload_df_reduced %>%
bind_rows()
return(combined)
}
When I ran his code for the 2019 season payload <- annual_statcast_query(2019), I could scrape the data without any problems. However, when I tried it for 2020 payload <- annual_statcast_query(2020) I encountered the error:
Error: Can't combine `spin_rate_deprecated` <logical> and `spin_rate_deprecated` <character>.
This error occurs in the last part of the annual_statcast_query function:
combined <- payload_df_reduced %>%
bind_rows()
When reading through the statcast documentation (https://baseballsavant.mlb.com/csv-docs), it appears that the variable spin_rate_depreceated was replaced by release_spin. Perhaps this is why I am encountering this error. I do not need this variable for my analysis, and the error tracing I did made it very obvious that fixing the problem is beyond my skill set as a college student.
> rlang::last_error()
<error/vctrs_error_incompatible_type>
Can't combine `spin_rate_deprecated` <logical> and `spin_rate_deprecated` <character>.
Backtrace:
1. global::annual_statcast_query(2020)
3. dplyr::bind_rows(.)
4. vctrs::vec_rbind(!!!dots, .names_to = .id)
6. vctrs::vec_default_ptype2(...)
7. vctrs:::vec_ptype2_df_fallback(x, y, opts)
8. vctrs:::vec_ptype2_params(...)
9. vctrs:::vec_ptype2_opts(x, y, opts = opts, x_arg = x_arg, y_arg = y_arg)
11. vctrs::vec_default_ptype2(...)
12. vctrs::stop_incompatible_type(...)
13. vctrs:::stop_incompatible(...)
14. vctrs:::stop_vctrs(...)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/vctrs_error_incompatible_type>
Can't combine `spin_rate_deprecated` <logical> and `spin_rate_deprecated` <character>.
Backtrace:
x
1. +-global::annual_statcast_query(2020)
2. | \-payload_df_reduced %>% bind_rows()
3. \-dplyr::bind_rows(.)
4. \-vctrs::vec_rbind(!!!dots, .names_to = .id)
5. \-(function () ...
6. \-vctrs::vec_default_ptype2(...)
7. \-vctrs:::vec_ptype2_df_fallback(x, y, opts)
8. \-vctrs:::vec_ptype2_params(...)
9. \-vctrs:::vec_ptype2_opts(x, y, opts = opts, x_arg = x_arg, y_arg = y_arg)
10. \-(function () ...
11. \-vctrs::vec_default_ptype2(...)
12. \-vctrs::stop_incompatible_type(...)
13. \-vctrs:::stop_incompatible(...)
14. \-vctrs:::stop_vctrs(...)
Therefore, I tried to drop this variable from my database before the bind rows operation to avoid the error.
combined <- payload_df_reduced %>%
payload_df_reduced[ , !names(payload_df_reduced) %in% c("spin_rate_deprecated")] %>%
bind_rows()
However, this returned the error message:
Error in .[payload_df_reduced, , !names(payload_df_reduced) %in% c("spin_rate_deprecated")] :
incorrect number of dimensions
I am running
packageVersion("baseballr") [1] ‘0.8.3’
On R 4.03
If anyone could help me find a way to do this, that would be amazing. I am not picky about how I get this data, so I am all ears if anyone has an idea. Thank you so much!
To drop a column from data.frame you should do this:
payload_df_reduced %>%
select(-c(spin_rate_deprecated))
or if using your current way it should be like this
payload_df_reduced[ , !names(payload_df_reduced) %in% c("spin_rate_deprecated")]
Your current code is not work because it is incorrect grammar.
It seem that your payload_df_reduced is a list of data.frame not one data.frame. I tried to run your code but it seem you have other functions so not reproducible. Here is a theory code that you may need to adjust a bit.
combined <- map(payload_df_reduced, select, -c(spin_rate_deprecated)) %>%
bind_rows()

Trouble making a new column using case_when in Dplyr

I'm try to make a new column in my data frame which will contain only the receivers we set up in one are of study. I've checked other pages on here but I still get the same error:
<error/dplyr:::mutate_error>
Problem with mutate() input ER.
x Case 1 (Receiver %in% c("1326", "1315", "1314", "1321", "1404", "1318", "1325", "1313...) must be a two-sided formula, not a logical vector.
i Input ER is case_when(...).
Backtrace:
Run rlang::last_trace() to see the full context.
I've tried:
ev<-ev %>%
select(Receiver) %>%
mutate(ER=case_when(c(Receiver=="1315"|
Receiver=="1314"|
Receiver=="1321"|
Receiver=="1404"|
Receiver=="1318"|
Receiver=="1325"|
Receiver=="1313"|
Receiver=="1323"|
Receiver=="1324"|
Receiver=="1320"|
Receiver=="1319"|
Receiver=="1317")))
And:
ev<-ev %>%
mutate(ER=case_when(Receiver %in% c("1326", "1315", "1314", "1321", "1404", "1318", "1325", "1313", "1323", "1324", "1320", "1319", "1317")))
Any help showing me where I've gone wrong is much appreciated.
Let's assume you have two studies, study_A and study_B. Then you need to add replacement values e.g. for only two of your given strings:
ev <- ev %>%
mutate(ER = case_when(Receiver == "1326" ~ "study_A",
Receiver == "1315" ~ "study_B"))

Filter data with !is.na() . Alternative to !is na?

I'm trying to replicate a code so this is my following code:
statefips <- read.csv("https://raw.githubusercontent.com/kjhealy/fips-codes/master/state_fips_master.csv") %>%
select(state_name, state_abbr, region_name, division_name) %>% dplyr::rename(state = state_name)
That code works fine.
But there is an issue here:
uspop <- read.csv("https://raw.githubusercontent.com/JoseMontoya518/uspop2018/master/PEP_2018_PEPSR6H_with_ann.csv") %>%
janitor::clean_names()%>%
dplyr::filter(year_id == "est72018", !is.na(statefips))
I get this error message:
Error: Problem with `filter()` input `..2`.
x Input `..2` must be of size 5149 or 1, not size 50.
i Input `..2` is `!is.na(statefips)`.
So I try another way , instead of adding:
!is.na(statefips)
as an input in filter function. I use this:
%>% na.exclude
So this is the final code:
uspop <- read.csv("https://raw.githubusercontent.com/JoseMontoya518/uspop2018/master/PEP_2018_PEPSR6H_with_ann.csv") %>%
janitor::clean_names()%>%
dplyr::filter(year_id == "est72018")%>% na.exclude
That code works ,but I don't know if the purpose of the code is achieved.
!is.na(statefips)
was added as an input for a reason in the code which I'm replicating. When I remove %>% na.exclude, nothing change from the original data frame:
uspop <- read.csv("https://raw.githubusercontent.com/JoseMontoya518/uspop2018/master/PEP_2018_PEPSR6H_with_ann.csv") %>%
janitor::clean_names()%>%
dplyr::filter(year_id == "est72018")
So , Is there a way to filter the data with this input: !is.na(statefips) ?

How to extract columns from a row and save the output as a variable dplyr

I am trying to extract a specific column from a specific row on my excel sheet (df). However, when I try to do so I get the message:
Error: ... must evaluate to column positions or names, not a list
Call `rlang::last_error()` to see a backtrace.
When I call rlang::last_error() I get:
Backtrace:
1. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin)
9. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...))
10. tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n not { first_type }")
11. tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)
12. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin)
At this point, I am lost. What can I do to my code to work?
library(readxl)
Lakers_Overall_Stats <- read_excel("Desktop/Lakers Overall Stats.xlsx")
library(readxl)
Lakers_Record <- read_excel("Desktop/Lakers Record.xlsx")
require(dplyr)
require(ggplot2)
##WinPercentage of the Team after season
mydata <- Lakers_Record %>% select(Pts,Opp,W,L)%>%
+ mutate(wpct=Pts^13.91/(Pts^13.91+Opp^13.91),expwin=round(wpct*(W+L)),diff=W-expwin)
head(mydata)
##Specifiying
Lakers_Overall_Stats[23,6] <- TmMin
Lakers_Overall_Stats[23,8] <- TmFga
Lakers_Overall_Stats[23,18] <- TmFta
Lakers_Overall_Stats[23,26] <- TmTov
rlang::last_error()
##Usage Percentage
Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>%
+ mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP)
##head(Usgpct)
##filter(rank(desc(Usgpct))==1)
Also, am I filtering correctly? or should it be written as
Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>%
filter(rank(desc(Usgpct))==1)%>%
mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP)
head(Usgpct)
You have
Lakers_Overall_Stats[23,6] <- TmMin
This will modify the Lakers_Overall_Stats data frame by setting the element at 23,6 etc. to be TmMin. TmMin is an object outside of your data frame.
Maybe you want:
TmMin <- Lakers_Overall_Stats[23,6]
?
Also, you cannot select TmFga,TmFta,TmTov,TmMin since these variables are not part of your data frame. You can refer to those variables in your mutate equation, but because of the way you've set it up, they're stand-alone variables.

Resources