Trouble making a new column using case_when in Dplyr - r

I'm try to make a new column in my data frame which will contain only the receivers we set up in one are of study. I've checked other pages on here but I still get the same error:
<error/dplyr:::mutate_error>
Problem with mutate() input ER.
x Case 1 (Receiver %in% c("1326", "1315", "1314", "1321", "1404", "1318", "1325", "1313...) must be a two-sided formula, not a logical vector.
i Input ER is case_when(...).
Backtrace:
Run rlang::last_trace() to see the full context.
I've tried:
ev<-ev %>%
select(Receiver) %>%
mutate(ER=case_when(c(Receiver=="1315"|
Receiver=="1314"|
Receiver=="1321"|
Receiver=="1404"|
Receiver=="1318"|
Receiver=="1325"|
Receiver=="1313"|
Receiver=="1323"|
Receiver=="1324"|
Receiver=="1320"|
Receiver=="1319"|
Receiver=="1317")))
And:
ev<-ev %>%
mutate(ER=case_when(Receiver %in% c("1326", "1315", "1314", "1321", "1404", "1318", "1325", "1313", "1323", "1324", "1320", "1319", "1317")))
Any help showing me where I've gone wrong is much appreciated.

Let's assume you have two studies, study_A and study_B. Then you need to add replacement values e.g. for only two of your given strings:
ev <- ev %>%
mutate(ER = case_when(Receiver == "1326" ~ "study_A",
Receiver == "1315" ~ "study_B"))

Related

Error: 'list' object cannot be coerced to type 'double' in R

I'm new to R. I'm trying to get the SD of weight in lbs. First I'm getting the weight in lbs from a dataset with weight in kg. When I get type of() for the result, it's a list. But in the console, its a 'list' of 'dbl'. I've tried 'as.numeric()' and 'as.integer()' in the pipe but both give the same error. How can I get the SD?
I have other questions that have similar issues (data type being a list when they should be numeric) so if you can explain why that's happening that would be great!
weight_lbs <- brfss %>%
clean_names(., "lower_camel") %>%
select(havarth3, wtkg3)%>%
filter(havarth3 == "1")%>%
na.omit()%>%
mutate(weight_lbs=(round(wtkg3*2.20462)/100),2)%>%
select(weight_lbs)%>%
as.numeric()
weight_lbs
sd_weight <- sd(weight_lbs, na.rm=TRUE)
Try this code:
I think as.numeric() alone won't work. wrap it into a mutate:
weight_lbs <- brfss %>%
clean_names(., "lower_camel") %>%
select(havarth3, wtkg3)%>%
filter(havarth3 == "1")%>%
na.omit()%>%
mutate(weight_lbs=(round(wtkg3*2.20462)/100),2)%>%
select(weight_lbs)%>%
mutate(weight_lbs = as.numeric(weight_lbs)) %>%
mutate(sd_weight_lbs = sd(weight_lbs))

Column name being duplicated in recipe

This is the piece of code i'm having troubles with:
pump_recipe <- recipe(status_group ~ ., data = data) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_knn(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
prepared_rec <- prep(pump_recipe)
The error:
Error:
! Column name `funder_W.D...I.` must not be duplicated.
Use .name_repair to specify repair.
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
* "funder_W.D...I." at locations 1807 and 1808.
Backtrace:
1. recipes::prep(pump_recipe)
2. recipes:::prep.recipe(pump_recipe)
4. recipes:::bake.step_dummy(x$steps[[i]], new_data = training)
8. tibble:::as_tibble.data.frame(indicators)
9. tibble:::lst_to_tibble(unclass(x), .rows, .name_repair)
...
16. vctrs `<fn>`()
17. vctrs:::validate_unique(names = names, arg = arg)
18. vctrs:::stop_names_must_be_unique(names, arg)
19. vctrs:::stop_names(...)
20. vctrs:::stop_vctrs(class = c(class, "vctrs_error_names"), ...)
Error:
Caused by error in `stop_vctrs()`:
! Names must be unique.
x These names are duplicated:
* "funder_W.D...I." at locations 1807 and 1808.
So basically it seems like the step_dummy step is doing something strange, and creating a duplicated column here. I don't know why this is happening. This is the data I'm working with:
https://github.com/norhther/datasets/blob/main/data.csv
You are having levels in funder and installer that are so similar that step_dummy() creates labels of the same name. The error says that funder_W.D...I. appears twice.
If we do some filtering on the funder column we see that there are 3 different names that match.
str_subset(data$funder, "W.D") |> unique()
[1] "W.D.&.I." "W.D & I." "W.D &"
Neither "W.D.&.I." or "W.D & I." are valid names so step_dummy() tries to fix them. This yields "funder_W.D...I." for both.
You can fix this by using textrecipes::step_clean_levels(), this make sure that the levels of these variables stay valid and non-overlapping.
library(recipes)
pump_recipe <- recipe(status_group ~ ., data = data) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_knn(all_nominal_predictors()) %>%
textrecipes::step_clean_levels(funder, installer) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
prepared_rec <- prep(pump_recipe)
Note: As you say, I would imagine that "W.D.&.I.", "W.D & I." and "W.D &" all refer to the same entity. You should take a look to see if you can collapse these levels manually.

Create binary yes/no animal variable based on match with any term in a dictionary, "animal" in R

Continuing off this question: R: Create category column reflecting match between a dictionary and column in df
I have a big dataset, "df", of 30,000 rows, and two big dictionary dataframes: (1) animal, 600k rows; (2)nature, 300k rows.
I am simply trying to figure out how to create two simple binary variables, "df$content_animal" and "df$content_nature" based on whether each row in df$content had any matches with "animal" or "nature" dictionaries. (1=match, 0=no match).
Below are the data samples, it's impossible for me to include the entire datasets here:
df <- tibble(content= c("hello turkey feet blah blah blah", "i love rabbits haha", "wow this sunlight is amazing", "omg did u see the rainbow?!", "turtles like swimming in the water", "i love running across grassy lawns with my dog"))
animal=c("turkey", "rabbit", "turtle", "dog", "cat", "bear")
nature=c("sunlight", "water", "rainbow", "grass", "lawn", "mountain", "ice")
I have tried the following codes based on multiple-pattern matches, to no success - I suspect it is bc of the largeness of both my dataset and dictionary/pattern:
df$content_animal <- grepl(paste(animal,collapse="|"),df$content,ignore.case=TRUE)
df$content_nature <- grepl(paste(nature,collapse="|"),df$content,ignore.case=TRUE)
which returns the error:
Error in grepl(paste(animal,collapse="|"), df$content, :
invalid regular expression, reason 'Out of memory' Error in grepl(paste(nature,collapse="|"), df$content, :
invalid regular expression, reason 'Out of memory'
I also tried:
df<-df %>%
mutate(
content_animal = case_when(grepl(animal, content) ~ "1")
)
df<-df %>%
mutate(
content_nature = case_when(grepl(nature, content) ~ "1")
)
which returns the error:
Problem with `mutate()` input `content_animal`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
ℹ Input `content_animal` is `case_when(grepl(animal, content) ~ "1")`.argument 'pattern' has length > 1 and only the first element will be used
Problem with `mutate()` input `content_nature`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
ℹ Input `content_nature` is `case_when(grepl(nature, content) ~ "1")`.argument 'pattern' has length > 1 and only the first element will be used
I ALSO tried
bench::mark(basic = mutate(df, content_animal = 1L*map_lgl(content, ~any(str_detect(.x, animal))),
content_nature = 1L*map_lgl(content, ~any(str_detect(.x, nature)))),
fixed = mutate(df, content_animal = 1L*map_lgl(content, ~any(str_detect(.x, fixed(animal)))),
content_nature = 1L*map_lgl(content, ~any(str_detect(.x, fixed(nature))))))
which ran for over two hours, without giving me any output.
I'm really at a loss here as to what I should do. Does anyone have any ideas? It there a better package or code to use for my big data purposes???
It may be better to loop with lapply and Reduce
Reduce(`|`, lapply(nature, function(x) grepl(x, df$content, ignore.case = TRUE)))
#[1] FALSE FALSE TRUE TRUE TRUE TRUE
which is the same as
grepl(paste(nature,collapse="|"),df$content,ignore.case=TRUE)
#[1] FALSE FALSE TRUE TRUE TRUE TRUE
Here's an approach with the quanteda package, which has built-in functions for doing exactly what you want. (I tried this only on the sample dataset; I'd be interested to hear what its performance is on the whole thing.)
library(quanteda)
c = corpus(df$content)
d = dictionary(list(animal = animal, nature = nature))
df = cbind(df, convert(dfm(c, dictionary = d), to = "data.frame")[,-1])

Filter data with !is.na() . Alternative to !is na?

I'm trying to replicate a code so this is my following code:
statefips <- read.csv("https://raw.githubusercontent.com/kjhealy/fips-codes/master/state_fips_master.csv") %>%
select(state_name, state_abbr, region_name, division_name) %>% dplyr::rename(state = state_name)
That code works fine.
But there is an issue here:
uspop <- read.csv("https://raw.githubusercontent.com/JoseMontoya518/uspop2018/master/PEP_2018_PEPSR6H_with_ann.csv") %>%
janitor::clean_names()%>%
dplyr::filter(year_id == "est72018", !is.na(statefips))
I get this error message:
Error: Problem with `filter()` input `..2`.
x Input `..2` must be of size 5149 or 1, not size 50.
i Input `..2` is `!is.na(statefips)`.
So I try another way , instead of adding:
!is.na(statefips)
as an input in filter function. I use this:
%>% na.exclude
So this is the final code:
uspop <- read.csv("https://raw.githubusercontent.com/JoseMontoya518/uspop2018/master/PEP_2018_PEPSR6H_with_ann.csv") %>%
janitor::clean_names()%>%
dplyr::filter(year_id == "est72018")%>% na.exclude
That code works ,but I don't know if the purpose of the code is achieved.
!is.na(statefips)
was added as an input for a reason in the code which I'm replicating. When I remove %>% na.exclude, nothing change from the original data frame:
uspop <- read.csv("https://raw.githubusercontent.com/JoseMontoya518/uspop2018/master/PEP_2018_PEPSR6H_with_ann.csv") %>%
janitor::clean_names()%>%
dplyr::filter(year_id == "est72018")
So , Is there a way to filter the data with this input: !is.na(statefips) ?

How to extract columns from a row and save the output as a variable dplyr

I am trying to extract a specific column from a specific row on my excel sheet (df). However, when I try to do so I get the message:
Error: ... must evaluate to column positions or names, not a list
Call `rlang::last_error()` to see a backtrace.
When I call rlang::last_error() I get:
Backtrace:
1. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin)
9. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...))
10. tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n not { first_type }")
11. tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)
12. dplyr::select(., FGA, FTA, TOV, MP, TmFga, TmFta, TmTov, TmMin)
At this point, I am lost. What can I do to my code to work?
library(readxl)
Lakers_Overall_Stats <- read_excel("Desktop/Lakers Overall Stats.xlsx")
library(readxl)
Lakers_Record <- read_excel("Desktop/Lakers Record.xlsx")
require(dplyr)
require(ggplot2)
##WinPercentage of the Team after season
mydata <- Lakers_Record %>% select(Pts,Opp,W,L)%>%
+ mutate(wpct=Pts^13.91/(Pts^13.91+Opp^13.91),expwin=round(wpct*(W+L)),diff=W-expwin)
head(mydata)
##Specifiying
Lakers_Overall_Stats[23,6] <- TmMin
Lakers_Overall_Stats[23,8] <- TmFga
Lakers_Overall_Stats[23,18] <- TmFta
Lakers_Overall_Stats[23,26] <- TmTov
rlang::last_error()
##Usage Percentage
Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>%
+ mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP)
##head(Usgpct)
##filter(rank(desc(Usgpct))==1)
Also, am I filtering correctly? or should it be written as
Usgpct <- Lakers_Overall_Stats %>% select(FGA,FTA,TOV,MP,TmFga,TmFta,TmTov,TmMin)%>%
filter(rank(desc(Usgpct))==1)%>%
mutate(100*(Fga+0.44*Fta+Tov))*TmMin/(TmFga+0.44*TmFta+TmTov)*5(MP)
head(Usgpct)
You have
Lakers_Overall_Stats[23,6] <- TmMin
This will modify the Lakers_Overall_Stats data frame by setting the element at 23,6 etc. to be TmMin. TmMin is an object outside of your data frame.
Maybe you want:
TmMin <- Lakers_Overall_Stats[23,6]
?
Also, you cannot select TmFga,TmFta,TmTov,TmMin since these variables are not part of your data frame. You can refer to those variables in your mutate equation, but because of the way you've set it up, they're stand-alone variables.

Resources