use variable in str_detect's first argument - r

I want to use the str_detectfunction passing a variable as the first argument. Meaning this could theoretically look something like this.
# create the variable
var = names(mtcars)[1]
mtcars %>%
mutate(
new_var = case_when(str_detect(var, "^2"), "two", "other")
)
Now I'm not sure how to insert the variable var correctly into the str_detect function. I guess some tidy-eval is necessary, but I'm not sure....

using mtcars as an exmaple for string manipulation is not very helpful, so switching over to iris. Also, your case_when specification was wrong, so I'm using if_else for this example.
You can use !!(sym(var))
library(tidyverse)
var <- "Species"
iris %>%
mutate(
new_var = if_else(str_detect(!!sym(var), "set"), "two", "other")
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
1 5.1 3.5 1.4 0.2 setosa two
2 4.9 3.0 1.4 0.2 setosa two
3 4.7 3.2 1.3 0.2 setosa two
4 4.6 3.1 1.5 0.2 setosa two
5 5.0 3.6 1.4 0.2 setosa two
6 5.4 3.9 1.7 0.4 setosa two

Related

R FuzzyJoin by clause with variables

I'm trying to adapt the inner join feature of the fuzzyjoin library.
The code:
JoinedRecs <- DataToUse1 %>%
stringdist_inner_join(DataToUse2, by = c(Full.Name1 = "Full.Name2"), max_dist = 2)
seems to work when I hard-code the variables in the "by = " clause.
However, I want to use variables, where:
Column1 <- "Full.Name1"
Column2 <- "Full.Name2"
I've tried a number of variations on possible syntax, but I always get the same error message:
Error: Must group by variables found in .data.
Column col is not found.
If someone could inform me what the right code is for "by = " clause using variables rather than hard-coding the names, I would be ever-so grateful.
Thanks!
We can use setNames to create a named vector in by
library(fuzzyjoin)
JoinedRecs <- DataToUse1 %>%
stringdist_inner_join(DataToUse2,
by = setNames(Column2, Column1), max_dist = 2)
-reproducible example
> iris2 <- data.frame(Species2 = 'setosa', value = 1)
> Column1 <- 'Species'
> Column2 <- 'Species2'
> stringdist_inner_join(head(iris), iris2,
by = setNames(Column2, Column1), max_dist = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2 value
1 5.1 3.5 1.4 0.2 setosa setosa 1
2 4.9 3.0 1.4 0.2 setosa setosa 1
3 4.7 3.2 1.3 0.2 setosa setosa 1
4 4.6 3.1 1.5 0.2 setosa setosa 1
5 5.0 3.6 1.4 0.2 setosa setosa 1
6 5.4 3.9 1.7 0.4 setosa setosa 1

Operating on list of strings representing column names?

I'm currently trying to automate a data task that requires taking in a list of column names in string format, then summing those columns (rowwise). i.e., suppose there is some list as follows:
> list
[1] "colname1" "colname2" "colname3"
How would I go about passing in this list to some function like sum() in tidyverse? That is, I would like to run something like the following:
df <- df %>%
rowwise %>%
mutate(new_var = sum(list))
Any suggestions would be greatly, greatly appreciated. Thanks.
You could use rowSums here. For example:
library(dplyr)
mycols <- colnames(iris)[3:4]
mycols
[1] "Petal.Length" "Petal.Width"
Then:
iris %>%
mutate(new_var = rowSums(.[, mycols])) %>%
head()
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.6
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.7
5 5.0 3.6 1.4 0.2 setosa 1.6
6 5.4 3.9 1.7 0.4 setosa 2.1
You can pass the vector of column names in c_across.
library(dplyr)
df <- df %>% rowwise() %>% mutate(new_var = sum(c_across(list)))
df

specify variable names when grouping

I am using dplyr v1.0.2 to manipulate tibbles. I would like to use group_by(), using a function or a regular expression to specify the relevant variable names (the ... argument). The only solution that I've found is clunky. Is there a relatively simple way?
Here is a minimal example that demonstrates the problem:
library(dplyr)
data(iris)
iris[, -(rbinom(1, 1, .5) + 1) ] %>% # randomly drop "Sepal.Length" or "Sepal.Width"
group_by(matches("^Sepal\\."))
In the third line, I randomly drop one of the two "Sepal" columns. In the last line, I want to group by the remaining "Sepal" column. The problem is that I don't know its name: it could be either "Sepal.Length" or "Sepal.Width." And the group_by() command in the last line doesn't work: it predictably returns a matches() must be used within a *selecting* function error message.
By contrast, this code works, but it is a bit clunky:
iris[, -(rbinom(1, 1, .5) + 1) ] %>%
group_by(!!as.name(grep('Sepal', colnames(.), val = TRUE)))
Is there a simpler way to do the grouping on the second line?
What about using across to select the columns
iris[, -(rbinom(1, 1, .5) + 1) ] %>%
group_by(across(starts_with('Sepal')))
# A tibble: 150 x 4
# Groups: Sepal.Length [35]
Sepal.Length Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <fct>
1 5.1 1.4 0.2 setosa
2 4.9 1.4 0.2 setosa
3 4.7 1.3 0.2 setosa
4 4.6 1.5 0.2 setosa
5 5 1.4 0.2 setosa
6 5.4 1.7 0.4 setosa
7 4.6 1.4 0.3 setosa
8 5 1.5 0.2 setosa
9 4.4 1.4 0.2 setosa
10 4.9 1.5 0.1 setosa
# … with 140 more rows

select first occurrence of variable with prefix in dataframe

What is the best way to dplyr::select the first occurrence of a variable with a certain prefix (and all other variables without that prefix). Or put another way, drop all variables with that prefix except the first occurrence.
library(tidyverse)
hiris <- head(iris)
#given this data.frame:
lst(hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
# rowname Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
# 2 2 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 setosa
# 4 4 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2 setosa
# 5 5 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 setosa
# 6 6 5.4 3.9 1.7 0.4 setosa 5.4 3.9 1.7 0.4 setosa 5.4 3.9 1.7 0.4 setosa
Now lets say I want to drop all variables with prefix Sepal.Length except the first one (Sepal.Length.x) I could do:
lst(hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname") %>%
dplyr::select(-Sepal.Length.y, -Sepal.Length)
which works fine but I want something flexible so it will work with an arbitrary number of variables with prefix Sepal.Length e.g.:
lst(hiris, hiris, hiris, hiris, hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
I could do something like this:
df <- lst(hiris, hiris, hiris, hiris, hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
name_drop <- (df %>% select(matches("Sepal.Length")) %>% names())[-1]
df %>%
select(-name_drop)
but im looking to do it in a pipe and more efficiently. any suggestions?
thanks
I like this explanation of the problem:
drop all variables with that prefix except the first occurrence.
select(iris, !starts_with("Sepal")[-1])
# Sepal.Length Petal.Length Petal.Width Species
# 1 5.1 1.4 0.2 setosa
# 2 4.9 1.4 0.2 setosa
# ...
starts_with("Sepal") of course returns all columns that start with "Sepal", we can use [-1] to remove the first match, and ! to drop any remaining matches.
It does seem a little like black magic - if we were doing this in base R, the [-1] would be appropriate if we used which() to get column indices, and the ! would be appropriate if we didn't use which() and had a logical vector, but somehow the tidyselect functionality makes it work!

Use column names from vector in for loop in dplyr

this should probably be quite straightforward, but I am struggling to get it to work. I currently have a vector of column names:
columns <- c('product1', 'product2', 'product3', 'support4')
I now want to use dplyr in a for loop to mutate some columns, but I am struggling to make it recognize that it is a column name, not a variable.
for (col in columns) {
cross.sell.val <- cross.sell.val %>%
dplyr::mutate(col = ifelse(col == 6, 6, col)) %>%
dplyr::mutate(col = ifelse(col == 5, 6, col))
}
Can I use %>% in these situations? Thanks..
You should be able to do this without using a for loop at all.
Because you didn't provide any data, I am going to use the builtin iris dataset. The top of it looks like:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
First, I am saving the columns to analyze:
columns <- names(iris)[1:4]
Then, use mutate_at for each column, along with that particular rule. In each, the . represents the vector for each column. Your example implies that the rules are the same for each column, though if that is not the case, you may need more flexibility here.
mod_iris <-
iris %>%
mutate_at(columns, funs(ifelse(. > 5, 6, .))) %>%
mutate_at(columns, funs(ifelse(. < 1, 1, .)))
returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.0 3.5 1.4 1 setosa
2 4.9 3.0 1.4 1 setosa
3 4.7 3.2 1.3 1 setosa
4 4.6 3.1 1.5 1 setosa
5 5.0 3.6 1.4 1 setosa
6 6.0 3.9 1.7 1 setosa
If you wanted to, you could instead write a function to make all of your changes for the column. This could also allow you to set the cutoffs differently for each column. For example, you may want to set the bottom and top portions of the data to be equal to that threshold (to reign in outliers for some reason), or you may know that each variable uses a dummy value as a placeholder (and that value is different by column, but is always the most common value). You could easily add in any arbitrary rule of interest this way, and it gives you a bit more flexibility than chaining together separate rules (e.g., if you use the mean, the mean changes when you change some of the values).
An example function:
modColumns <- function(x){
botThresh <- quantile(x, 0.25)
topThresh <- quantile(x, 0.75)
dummyVal <- as.numeric(names(sort(table(x)))[1])
dummyReplace <- NA
x <- ifelse(x < botThresh, botThresh, x)
x <- ifelse(x > topThresh, topThresh, x)
x <- ifelse(x == dummyVal, dummyReplace, x)
return(x)
}
And in use:
iris %>%
mutate_at(columns, modColumns) %>%
head
returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.3 1.6 0.3 setosa
2 5.1 3.0 1.6 0.3 setosa
3 5.1 3.2 1.6 0.3 setosa
4 5.1 3.1 1.6 0.3 setosa
5 5.1 3.3 1.6 0.3 setosa
6 5.4 3.3 1.7 0.4 setosa

Resources