dummy variable in R for partial string - r

I am want to create a dummy variable that is 1 if it is contains a part of the numbers. For some reason the str_detect is not working. My error code is as follows:
Error in type(pattern) : argument "pattern" is missing, with no default
sam_data_rd$high_int <- as.integer(str_detect(sam_data_rd$assertions.primarynaics,
c("2111", "3254", "3341", "3342", "3344","3345", "3364", "5112", "5171", "51331",
"5179", "5133Z", "5182", "5191", "5142", "5141Z", "5191Z","5191ZM", "5413", "5415", "5417")))

Try this:
library(dplyr)
library(stringr)
pattern <- paste(c("2111", "3254", "3341", "3342", "3344","3345", "3364", "5112", "5171", "51331",
"5179", "5133Z", "5182", "5191", "5142", "5141Z", "5191Z","5191ZM", "5413", "5415", "5417"), collapse = "|")
sam_data_rd %>%
mutate(high_int = ifelse(str_detect(assertions.primarynaics, pattern), 1, assertions.primarynaics)

The pattern can be a single string with OR (|). Note that the pattern is vectorized to allow multiple elements, but the condition is that the length of the pattern should match the length of the string (or the column i.e. it will be an elementwise comparison)
library(stringr)
v1 <- c("2111", "3254", "3341", "3342", "3344","3345", "3364", "5112", "5171", "51331", "5179", "5133Z", "5182", "5191", "5142", "5141Z", "5191Z","5191ZM", "5413", "5415", "5417")
pat <- str_c("\\b(", str_c(v1, collapse = "|"), ")\\b")
sam_data_rd$high_int <-
as.integer(str_detect(sam_data_rd$assertions.primarynaics, pat))
Or another option is to loop over each of the elements and then reduce it to a single logical vector
library(purrr)
library(dplyr)
sam_data_rd <- sam_data_rd %>%
mutate(high_int = map(v1,
~ str_detect(assertions.primarynaics, .x)) %>%
reduce(`|`) %>%
as.integer)

Related

Is there a R function that allows you to insert a variable inside a command?

For example:
Imagine I have an object named "cors" which contains a string only ("Spain" for example). I would like then for "cors" to be replaced in the expression (1) below by "Spain", resulting in the expression (2):
#(1)
DF <- DF %>% filter(str_detect(Country, "Germany|cors", negate = TRUE))
#(2)
DF <- DF %>% filter(str_detect(Country, "Germany|Spain", negate = TRUE))
P.S: I know that in MATLAB this could be handled with the "eval()" command, though in R it apparently has a completely different applicability.
If we have an object, then place it outside the quotes and use paste/str_c to create the string
library(dplyr)
library(stringr)
cors <- "Spain"
pat <- str_c(c("Germany", cors), collapse = "|")
DF %>%
filter(str_detect(Country, pat, negate = TRUE))
Or another option is to string interpolate with glue (assuming cors object have only a single string element)
DF %>%
filter(str_detect(Country, glue::glue("Germany|{cors}"), negate = TRUE))
Or this can be done in base R with grepl
pat <- paste(c("Germany", cors), collapse = "|")
subset(DF, !grepl(pat, Country))
If you really want eval, you could do:
cors <- 'Spain'
DF <- DF %>% filter(
eval(
parse(text=paste0('str_detect(Country, "Germany|', cors, '", negate=TRUE)'))
))

Cannot remove "," in data.frame(they said "Unexpected numeric constant.")

I have that df.
I'd like to remove comma in field in cloumn "2019" ~ "2015".
So I used the following function.
(df <- as.numeric(gsub(",", "", df$2019)))
but The R said
"Error : that is unexpected numeric constant.
df <- as.numeric(gsub(",", "", df$2019)
^ "
How can I solve the problem??
You can use lapply to loop over all the columns and remove commas and turn them to numeric.
df[] <- lapply(df, function(x) as.numeric(gsub(',', '', x)))
With tidyverse, we can do
library(dplyr)
library(stringr)
df <- df %>%
mutate(across(everything(), ~ as.numeric(str_remove_all(., ",")))

Checking a list of words in a column

I have the following code in tidyverse and list of words in words.xlsx like:
hello
world
program
data
analysis
v1 = read_excel('words.xlsx') %>%
mutate(words = tolower(words))%>%
pull(1)
for(v in v1){
data1 = data1 %>%
mutate(!! v := as.integer(heading %like% v))
}
I want to edit this code, so that instead of an integer, I get the actual words which were found in every string (separated with a comma) like in the image
You can paste all the words in v1 with word boundaries and use str_extract_all to extract any word in v1 present in data1$heading. str_extract_all would return list of words, we can use sapply to get them as one concatenated string.
sapply(stringr::str_extract_all(data1$heading,
paste0('\\b', v1, '\\b', collapse = '|')), function(x) toString(unique(x)))

Mapping str_detect over a list of strings to detect a second list of strings

Take a list of strings:
strings <- c("ABC_XZY", "qwe_xyz", "XYZ")
I'd like to get all elements in strings that don't contain a specific substring
avoid <- c("ABC")
I can do this
library(stringr)
library(dplyr)
library(purrr)
strings %>%
.[!map_lgl(., str_detect, avoid)]
[1] "qwe_xyz" "XYZ"
What I'd like to do though is specify several substrings
avoid_2 <- c("ABC", "qwe")
And then map over the list as before (doesn't work)
strings %>%
.[!map_lgl(., str_detect, avoid_2)]
Error: Result 1 must be a single logical, not a logical vector of length 2
What I want is
[1] "XYZ"
The error is clear - each element of string is generating a logical for each element of avoid_2, for total of 2 logical/element and map_lgl can only handle one/element.
I can of course do each substring separately, but I don't want to - I want to make a list of substrings
don't want, but does work
strings %>%
.[!map_lgl(., str_detect, "ABC")] %>%
.[!map_lgl(., str_detect, "qwe")]
One option could be:
strings[map_lgl(strings, ~ !any(str_detect(., avoid_2)))]
[1] "XYZ"
Or doing directly:
strings[!str_detect(strings, paste(avoid_2, collapse = "|"))]
In addition to the answers already provided, it's worth noting that stringr::str_detect and, therefore, stringr::str_subset are vectorised over both their string and the pattern arguments. This means that you don't actually need any kind of explicit iteration (via loop, lapply, or map) or calls to paste:
library(stringr)
strings <- c("ABC_XZY", "qwe_xyz", "XYZ")
avoid_2 <- c("ABC", "qwe")
str_subset(strings, avoid_2, negate = TRUE)
#> Warning in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, :
#> longer object length is not a multiple of shorter object length
#> [1] "XYZ"
Rather annoyingly, this generates a warning (which seems to stem from the underlying dependency on stringi::str_subset_regex). Crucially, though, it produces the expected results.
We can loop over the 'avoid_2' pattern vector instead of 'string' as 'string' argument is vectorized (if the pattern is also of the same length as 'string', then both of them can be passed for elementwise checking), then reduce the logical vector with |, negate and extract the elements from the 'strings' vector
library(dplyr)
library(stringr)
library(purrr)
avoid_2 %>%
map(~ str_detect(strings, .x)) %>%
reduce(`|`) %>% `!` %>%
magrittr::extract(strings, .)
#[1] "XYZ"
Or using base R with grep where we can pass the invert to get the opposite values of the matching pattern
grep(paste(avoid_2, collapse="|"), strings, invert = TRUE, value = TRUE)
#[1] "XYZ"
You can paste all the avoid_2 strings together and collapse them with "|". This creates a regular expression that you can feed into discard and str_detect.
library(tidyverse)
strings <- c("ABC_XZY", "qwe_xyz", "XYZ")
avoid_2 <- c("ABC", "qwe")
avoid_2 <- avoid_2 %>%
paste(., collapse = "|")
avoid_2
[1] "ABC|qwe"
#discard any values in strings that are also in avoid_2
strings %>%
discard(str_detect(., avoid_2))
[1] "XYZ"
Here's another method I found on RStudio community
library(tidyverse)
testlist <- list(
list(aaa_x = 1, aaa_y = 2, aaa_z = 5, bbb_a = 333, bbb_b = 222),
list(aaa_x = 7, aaa_y = 5, aaa_z = 6, bbb_a = 3939, bbb_b = 5635)
)
result_1 <- map(testlist, function(x) keep(x, .p = str_detect(names(x), "aaa")))
result_2 <- map(testlist, ~ keep(.x, .p = str_detect(names(.x), "aaa")))
identical(result_1, result_2)

Mutating Columns with paste0

I'm looking to dynamically name columns. I need to duplicate variables with new names. Why isn't the new_sepal_length_2 variable the same as new_sepal_length? How can I fix this?
new_var = 'Sepal.Length'
iris %>% mutate(new_sepal_length = Sepal.Length,
new_sepal_length_2 = noquote(paste0(new_var)))
We can convert it to symbol (sym) and evaluate (!!)
library(dplyr)
library(stringr)
iris %>%
mutate(new_sepal_length = str_c(!!rlang::sym(new_var), collapse=", "))
Or another option is to make use of mutate_at which can take strings in vars
iris %>%
mutate_at(vars(new_var), list(new= ~ str_c(., collapse=", ")))
Or use paste
iris %>%
mutate(new_sepal_length = paste(!!rlang::sym(new_var), collapse = ", "))
paste0 or paste by itself only converts to character class. Perhaps, we may need to use the arguments in paste

Resources