Succinct subsetting across multiple columns in R - r

Say I have a massive dataframe and in multiple columns I have an extremely large list of unique codes and I want to use these codes to select certain rows to subset the original dataframe. There are around 1000 codes and the codes I want all follow after each other. For example I have about 30 columns that contain codes and I only want to take rows that have codes 100 to 120 in ANY of these columns .
There's a long way to do this which is something like
new_dat <- df[which(df$codes==100 | df$codes==101 | df$codes1==100
and I repeat this for every single possible code for everyone of the columns that can contain these codes. Is there a way to do this in a more convenient fashion?
I want to try solving this with dplyr's select function, but I'm having trouble seeing if it works for my case out of the box
Take the iris dataset
Say I wanted all rows that contain the value 4.0-5.0 in any columns that contains the word Sepal in the column name.
#this only goes for 4.0
brand_new_df <- select(filter(iris, Sepal.Length ==4.0 | Sepal.Width == 4.0))
but what I want is something like
brand_new_df <- select(filter(iris, contains(Sepal) == 4.0:5.0))
Is there a dplyr way to do this?

A corresponding across() version from #RonakShah's answer:
library(dplyr)
iris %>% filter(rowSums(across(contains('Sepal'), ~ between(., 4, 5))) > 0)
or
iris %>% filter(rowSums(across(contains('Sepal'), between, 4, 5)) > 0)
From vignette("colwise"):
Previously, filter() was paired with the all_vars() and any_vars() helpers. Now, across() is equivalent to all_vars(), and there’s no direct replacement for any_vars().
So you need something like rowSums(...) > 0 to achieve the effect of any_vars().

You can use filter_at :
library(dplyr)
iris %>% filter_at(vars(contains('Sepal')), any_vars(between(., 4, 5)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#3 4.6 3.1 1.5 0.2 setosa
#4 5.0 3.6 1.4 0.2 setosa
#5 4.6 3.4 1.4 0.3 setosa
#6 5.0 3.4 1.5 0.2 setosa
#7 4.4 2.9 1.4 0.2 setosa
#....

Base R:
# Subset:
cols <- grep("codes", names(df2), value = TRUE)
df2[rowSums(sapply(cols,
function(x) {
df2[, x] >= 100 & df2[, x] <= 120
})) == length(cols), ]
# Data:
tmp <- data.frame(x1 <- rnorm(999, mean = 100, sd = 2))
df <-
setNames(data.frame(tmp[rep(1, each = 80)]), paste0("codes", 1:80))
df2 <- cbind(id = 1:nrow(df), df)

One option could be:
iris %>%
filter(Reduce(`|`, across(contains("Sepal"), ~ between(.x, 4, 5))))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 1
3 4.6 3.1 1.5 0.2 1
4 5.0 3.6 1.4 0.2 1
5 4.6 3.4 1.4 0.3 1
6 5.0 3.4 1.5 0.2 1
7 4.4 2.9 1.4 0.2 1
8 4.9 3.1 1.5 0.1 1
9 4.8 3.4 1.6 0.2 1
10 4.8 3.0 1.4 0.1 1

library(dplyr)
df <- iris
# value to look for
val <- 4
# find columns
cols <- which(colSums(df == val , na.rm = TRUE) > 0L)
# filter rows
iris %>% filter_at(cols, any_vars(.==val))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 5.5 2.3 4.0 1.3 versicolor
3 6.0 2.2 4.0 1.0 versicolor
4 6.1 2.8 4.0 1.3 versicolor
5 5.5 2.5 4.0 1.3 versicolor
6 5.8 2.6 4.0 1.2 versicolor

Related

R FuzzyJoin by clause with variables

I'm trying to adapt the inner join feature of the fuzzyjoin library.
The code:
JoinedRecs <- DataToUse1 %>%
stringdist_inner_join(DataToUse2, by = c(Full.Name1 = "Full.Name2"), max_dist = 2)
seems to work when I hard-code the variables in the "by = " clause.
However, I want to use variables, where:
Column1 <- "Full.Name1"
Column2 <- "Full.Name2"
I've tried a number of variations on possible syntax, but I always get the same error message:
Error: Must group by variables found in .data.
Column col is not found.
If someone could inform me what the right code is for "by = " clause using variables rather than hard-coding the names, I would be ever-so grateful.
Thanks!
We can use setNames to create a named vector in by
library(fuzzyjoin)
JoinedRecs <- DataToUse1 %>%
stringdist_inner_join(DataToUse2,
by = setNames(Column2, Column1), max_dist = 2)
-reproducible example
> iris2 <- data.frame(Species2 = 'setosa', value = 1)
> Column1 <- 'Species'
> Column2 <- 'Species2'
> stringdist_inner_join(head(iris), iris2,
by = setNames(Column2, Column1), max_dist = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2 value
1 5.1 3.5 1.4 0.2 setosa setosa 1
2 4.9 3.0 1.4 0.2 setosa setosa 1
3 4.7 3.2 1.3 0.2 setosa setosa 1
4 4.6 3.1 1.5 0.2 setosa setosa 1
5 5.0 3.6 1.4 0.2 setosa setosa 1
6 5.4 3.9 1.7 0.4 setosa setosa 1

Select values in a table conditional to an external table

I'd like to select the first N values of each variables (columns) in a data set, where N varies by column and row and are given in an other table. An example below with the iris data:
data(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
## Create a fake external table
ext.tab <- data.table(species=c("setosa","versicolor", "virginica" ),N1=c(1:3),N2=c(3:5),N3=c(5:7),N4=c(7:9))
head(ext.tab)
species N1 N2 N3 N4
1: setosa 1 3 5 7
2: versicolor 2 4 6 8
3: virginica 3 5 7 9
Now for Iris setosa, I'd like to get the first maximum value (N1 in ext.tab) of column 1 ('sepal.length' in iris data), then the three max values (N2 in ext.tab) for column 2 (sepal.width), then the five max values (N3) for column 3 (petal.length) and so forth. Then moving to the Iris versicolor and do the same.
The result can be either a table or a list for each species with the values themselves or row indices for each variable (column). Any idea of a fast way to implement that?
Here is a tidyverse approach using a custom function. The function takes the variable and group names as character scalar and number of maximum values as numeric. Inside the function is a dplyr pipeline using .data pronoun. Then, I reshaped ext.tab to long form and applied get_maximum() row-wise.
library(tidyverse)
get_maximum <- \(.x, .group, .n_max, .dat) {
.dat %>%
filter(Species == .group) %>%
arrange(desc(.data[[.x]])) %>%
slice(seq_len(.n_max)) %>%
pull(.data[[.x]])
}
dat <- as_tibble(ext.tab) %>%
pivot_longer(-species) %>%
mutate(name = recode(
name,
N1 = "Sepal.Length",
N2 = "Sepal.Width",
N3 = "Petal.Length",
N4 = "Petal.Width"
)) %>%
rowwise() %>%
mutate(max_num = list(
get_maximum(name, species, value, iris)
)) %>%
ungroup()
If you need the unique maximum values, you can add distinct() inside the custom function.
get_maximum_unique <- \(.x, .group, .n_max, .dat) {
.dat %>%
filter(Species == .group) %>%
distinct(.data[[.x]]) %>%
arrange(desc(.data[[.x]])) %>%
slice(seq_len(.n_max)) %>%
pull(.data[[.x]])
}
Here is an option using data.table. I have taken the liberty of renaming the column names.
cols <- setdiff(names(ext.tab), "Species")
iris[ext.tab, on=.(Species), by=.EACHI,
.(.(mapply(function(x, n) -head(sort(-x, partial=n), n),
x=mget(cols), n=mget(paste0("i.", cols)), SIMPLIFY=FALSE)))]$V1
data:
library(data.table)
iris <- as.data.table(iris)
ext.tab <- data.table(Species=c("setosa", "versicolor", "virginica"),
Sepal.Length=c(1:3),
Sepal.Width=c(3:5),
Petal.Length=c(5:7),
Petal.Width=c(7:9))
output:
[[1]]
[[1]]$Sepal.Length
[1] 5.8
[[1]]$Sepal.Width
[1] 4.4 4.2 4.1
[[1]]$Petal.Length
[1] 1.9 1.9 1.7 1.7 1.7
[[1]]$Petal.Width
[1] 0.4 0.4 0.6 0.4 0.5 0.4 0.4
[[2]]
[[2]]$Sepal.Length
[1] 7.0 6.9
[[2]]$Sepal.Width
[1] 3.4 3.3 3.2 3.2
[[2]]$Petal.Length
[1] 5.1 4.8 4.9 5.0 4.9 4.8
[[2]]$Petal.Width
[1] 1.7 1.6 1.6 1.8 1.5 1.5 1.6 1.5
[[3]]
[[3]]$Sepal.Length
[1] 7.7 7.9 7.7
[[3]]$Sepal.Width
[1] 3.8 3.8 3.6 3.4 3.4
[[3]]$Petal.Length
[1] 6.4 6.3 6.7 6.9 6.7 6.6 6.1
[[3]]$Petal.Width
[1] 2.5 2.5 2.4 2.5 2.4 2.4 2.3 2.3 2.3
Short explanation:
Perform a left join iris[ext.tab, on=.(Species),
by=.EACHI means for each row of ext.tab
x=mget(cols) gets the columns in iris
mget(paste0("i.", cols)) gets the number of values required for each column
-head(sort(-x, partial=n), n) performs a partial sort and extract the first n values
SIMPLIFY=FALSE and .(.( )) are simply required to return the results as a list

Creating a data frame based on a simple VLOOKUP in R?

df <- iris
x <- data.frame(Petal.Length=c('1.7', '1.9', '3.5'))
The new data frame (dfnew) needs all 5 columns from "iris" extracted, for all the rows with the petal lengths specified in "x".
I've tried it this way, but it doesn't seem to work:
dfnew <- df$Petal.Length[x]
Using dplyr:
> library(dplyr)
> data(iris)
> (dfnew <- iris %>% filter(Petal.Length %in% c('1.7', '1.9', '3.5')) )
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.4 3.9 1.7 0.4 setosa
2 5.7 3.8 1.7 0.3 setosa
3 5.4 3.4 1.7 0.2 setosa
4 5.1 3.3 1.7 0.5 setosa
5 4.8 3.4 1.9 0.2 setosa
6 5.1 3.8 1.9 0.4 setosa
7 5.0 2.0 3.5 1.0 versicolor
8 5.7 2.6 3.5 1.0 versicolor
It's worth noting that this is what you are technically asking for with "VLOOKUP", but the comment from phiver might actually be what you want.
df <- iris
x <- data.frame(Petal.Length=c('1.7', '1.9', '3.5'), X = c('X','Y','Z'))
df.new <- merge(df, x, by = 'Petal.Length')

dplyr::select_if can use colnames and their values at the same time?

I want to select cols using colnames and their values in a single pipe chain without referring other objects, such as NAMES <- names(d). Can I do it with select_if() ?
For example,
I can use colnames to select cols.
(select(matches(...)) is smarter treating only colnames).
library(dplyr)
d <- iris %>% select(-Species) %>% tibble::as.tibble()
d %>% select_if(stringr::str_detect(names(.), "Petal"))
And I can use the values.
d %>% select_if(~ mean(.) > 5)
But how to use both of them ? (especially OR)
Below code is what I want (of course, don't run).
d %>% select_if(stringr::str_detect(names(.), "Petal") | ~ mean(.) > 5)
Any help would be greatly appreciated.
A workaround that is not too complicated is:
d %>% select_if(stringr::str_detect(names(.), "Petal") | sapply(., mean) > 5)
# or
d %>% select_if(grepl("Petal",names(.)) | sapply(., mean) > 5)
Which gives:
# A tibble: 150 x 3
Sepal.Length Petal.Length Petal.Width
<dbl> <dbl> <dbl>
1 5.1 1.4 0.2
2 4.9 1.4 0.2
3 4.7 1.3 0.2
4 4.6 1.5 0.2
5 5.0 1.4 0.2
6 5.4 1.7 0.4
7 4.6 1.4 0.3
8 5.0 1.5 0.2
9 4.4 1.4 0.2
10 4.9 1.5 0.1
# ... with 140 more rows

Use column names from vector in for loop in dplyr

this should probably be quite straightforward, but I am struggling to get it to work. I currently have a vector of column names:
columns <- c('product1', 'product2', 'product3', 'support4')
I now want to use dplyr in a for loop to mutate some columns, but I am struggling to make it recognize that it is a column name, not a variable.
for (col in columns) {
cross.sell.val <- cross.sell.val %>%
dplyr::mutate(col = ifelse(col == 6, 6, col)) %>%
dplyr::mutate(col = ifelse(col == 5, 6, col))
}
Can I use %>% in these situations? Thanks..
You should be able to do this without using a for loop at all.
Because you didn't provide any data, I am going to use the builtin iris dataset. The top of it looks like:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
First, I am saving the columns to analyze:
columns <- names(iris)[1:4]
Then, use mutate_at for each column, along with that particular rule. In each, the . represents the vector for each column. Your example implies that the rules are the same for each column, though if that is not the case, you may need more flexibility here.
mod_iris <-
iris %>%
mutate_at(columns, funs(ifelse(. > 5, 6, .))) %>%
mutate_at(columns, funs(ifelse(. < 1, 1, .)))
returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.0 3.5 1.4 1 setosa
2 4.9 3.0 1.4 1 setosa
3 4.7 3.2 1.3 1 setosa
4 4.6 3.1 1.5 1 setosa
5 5.0 3.6 1.4 1 setosa
6 6.0 3.9 1.7 1 setosa
If you wanted to, you could instead write a function to make all of your changes for the column. This could also allow you to set the cutoffs differently for each column. For example, you may want to set the bottom and top portions of the data to be equal to that threshold (to reign in outliers for some reason), or you may know that each variable uses a dummy value as a placeholder (and that value is different by column, but is always the most common value). You could easily add in any arbitrary rule of interest this way, and it gives you a bit more flexibility than chaining together separate rules (e.g., if you use the mean, the mean changes when you change some of the values).
An example function:
modColumns <- function(x){
botThresh <- quantile(x, 0.25)
topThresh <- quantile(x, 0.75)
dummyVal <- as.numeric(names(sort(table(x)))[1])
dummyReplace <- NA
x <- ifelse(x < botThresh, botThresh, x)
x <- ifelse(x > topThresh, topThresh, x)
x <- ifelse(x == dummyVal, dummyReplace, x)
return(x)
}
And in use:
iris %>%
mutate_at(columns, modColumns) %>%
head
returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.3 1.6 0.3 setosa
2 5.1 3.0 1.6 0.3 setosa
3 5.1 3.2 1.6 0.3 setosa
4 5.1 3.1 1.6 0.3 setosa
5 5.1 3.3 1.6 0.3 setosa
6 5.4 3.3 1.7 0.4 setosa

Resources