specify variable names when grouping - r

I am using dplyr v1.0.2 to manipulate tibbles. I would like to use group_by(), using a function or a regular expression to specify the relevant variable names (the ... argument). The only solution that I've found is clunky. Is there a relatively simple way?
Here is a minimal example that demonstrates the problem:
library(dplyr)
data(iris)
iris[, -(rbinom(1, 1, .5) + 1) ] %>% # randomly drop "Sepal.Length" or "Sepal.Width"
group_by(matches("^Sepal\\."))
In the third line, I randomly drop one of the two "Sepal" columns. In the last line, I want to group by the remaining "Sepal" column. The problem is that I don't know its name: it could be either "Sepal.Length" or "Sepal.Width." And the group_by() command in the last line doesn't work: it predictably returns a matches() must be used within a *selecting* function error message.
By contrast, this code works, but it is a bit clunky:
iris[, -(rbinom(1, 1, .5) + 1) ] %>%
group_by(!!as.name(grep('Sepal', colnames(.), val = TRUE)))
Is there a simpler way to do the grouping on the second line?

What about using across to select the columns
iris[, -(rbinom(1, 1, .5) + 1) ] %>%
group_by(across(starts_with('Sepal')))
# A tibble: 150 x 4
# Groups: Sepal.Length [35]
Sepal.Length Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <fct>
1 5.1 1.4 0.2 setosa
2 4.9 1.4 0.2 setosa
3 4.7 1.3 0.2 setosa
4 4.6 1.5 0.2 setosa
5 5 1.4 0.2 setosa
6 5.4 1.7 0.4 setosa
7 4.6 1.4 0.3 setosa
8 5 1.5 0.2 setosa
9 4.4 1.4 0.2 setosa
10 4.9 1.5 0.1 setosa
# … with 140 more rows

Related

R - tidyverse approach to split a dataframe by columns and keep a set of common columns

My question is very similar to this one, but I would prefer to have a tidyverse approach.
I have a dataset with several columns and I want to split it columnwise (not rowwise!), but keep a list of common columns in every dataset. To illustrate this, I will use the iris dataset, and let's say that Species is the common column that I want to keep.
It would be really easy to do it using just these simple operations:
iris1 <- iris[,c("Species", "Sepal.Width")]
iris2 <- iris[,c("Species", "Sepal.Length")]
iris3 <- iris[,c("Species", "Petal.Width")]
iris4 <- iris[,c("Species", "Petal.Length")]
So I want to achieve the same output as that, but in a tidyverse style and usable in a pipeline without breaking it.
One approach could be to make a function that extracts from iris the Species and the column (number or name) of your choice, then map those column numbers into your function.
library(dplyr)
make_df <- function(col) { iris %>% select(Species, {{ col }} )}
c(2,1,4,3) %>% purrr::map(make_df)
or as one line:
c(2,1,4,3) %>% map(~iris %>% select(Species, {{ .x }}))
This will output a list with four elements, each of which is a data frame like you describe. For many workflows that will be safer and more convenient than creating four free-floating data frames in the global environment.
c(2,1,4,3) %>% map(make_df) %>% map(head)
[[1]]
Species Sepal.Width
1 setosa 3.5
2 setosa 3.0
3 setosa 3.2
4 setosa 3.1
5 setosa 3.6
6 setosa 3.9
[[2]]
Species Sepal.Length
1 setosa 5.1
2 setosa 4.9
3 setosa 4.7
4 setosa 4.6
5 setosa 5.0
6 setosa 5.4
[[3]]
Species Petal.Width
1 setosa 0.2
2 setosa 0.2
3 setosa 0.2
4 setosa 0.2
5 setosa 0.2
6 setosa 0.4
[[4]]
Species Petal.Length
1 setosa 1.4
2 setosa 1.4
3 setosa 1.3
4 setosa 1.5
5 setosa 1.4
6 setosa 1.7

Extracting the observations of two species from the iris dataset in R [duplicate]

This question already has answers here:
Test if a vector contains a given element
(8 answers)
Closed 8 months ago.
The iris dataset contains 150 observations of three plant species (setosa, versicolor and virginica), being 50 observations of each species. I would like to create a new dataframe, called "a", containing only the observations of two of these species (setosa and versicolor). I have been trying to use the codes below to do this, but these codes apparently do some sort of cycling; that is, instead of returning observations from 1 to 100 (which is what I'd like), they return observations 1, 3, 5, 7, ..., 100.
data(iris)
a <- subset(iris, Species == c("setosa", "versicolor"))
or
a <- iris[iris$Species == c("setosa","versicolor"),]
I would be grateful if anyone can help me figure out what I am doing wrong. I am aware that there are much simpler ways to get the dataframe I want (e.g., the codes listed below), but I would really like to understand why the above codes do not work — I want to apply this to more complex datasets where I have to extract many species and I would like to extract them by calling them by name.
a <- iris[1:100,] # this returns the dataframe I want
# or
a <- subset(iris, Species != "virginica") # this returns the dataframe I want as well
library(tidyverse)
iris %>%
filter(Species %in% c("setosa", "versicolor"))
# A tibble: 100 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 90 more rows

R FuzzyJoin by clause with variables

I'm trying to adapt the inner join feature of the fuzzyjoin library.
The code:
JoinedRecs <- DataToUse1 %>%
stringdist_inner_join(DataToUse2, by = c(Full.Name1 = "Full.Name2"), max_dist = 2)
seems to work when I hard-code the variables in the "by = " clause.
However, I want to use variables, where:
Column1 <- "Full.Name1"
Column2 <- "Full.Name2"
I've tried a number of variations on possible syntax, but I always get the same error message:
Error: Must group by variables found in .data.
Column col is not found.
If someone could inform me what the right code is for "by = " clause using variables rather than hard-coding the names, I would be ever-so grateful.
Thanks!
We can use setNames to create a named vector in by
library(fuzzyjoin)
JoinedRecs <- DataToUse1 %>%
stringdist_inner_join(DataToUse2,
by = setNames(Column2, Column1), max_dist = 2)
-reproducible example
> iris2 <- data.frame(Species2 = 'setosa', value = 1)
> Column1 <- 'Species'
> Column2 <- 'Species2'
> stringdist_inner_join(head(iris), iris2,
by = setNames(Column2, Column1), max_dist = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2 value
1 5.1 3.5 1.4 0.2 setosa setosa 1
2 4.9 3.0 1.4 0.2 setosa setosa 1
3 4.7 3.2 1.3 0.2 setosa setosa 1
4 4.6 3.1 1.5 0.2 setosa setosa 1
5 5.0 3.6 1.4 0.2 setosa setosa 1
6 5.4 3.9 1.7 0.4 setosa setosa 1

Use column names from vector in for loop in dplyr

this should probably be quite straightforward, but I am struggling to get it to work. I currently have a vector of column names:
columns <- c('product1', 'product2', 'product3', 'support4')
I now want to use dplyr in a for loop to mutate some columns, but I am struggling to make it recognize that it is a column name, not a variable.
for (col in columns) {
cross.sell.val <- cross.sell.val %>%
dplyr::mutate(col = ifelse(col == 6, 6, col)) %>%
dplyr::mutate(col = ifelse(col == 5, 6, col))
}
Can I use %>% in these situations? Thanks..
You should be able to do this without using a for loop at all.
Because you didn't provide any data, I am going to use the builtin iris dataset. The top of it looks like:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
First, I am saving the columns to analyze:
columns <- names(iris)[1:4]
Then, use mutate_at for each column, along with that particular rule. In each, the . represents the vector for each column. Your example implies that the rules are the same for each column, though if that is not the case, you may need more flexibility here.
mod_iris <-
iris %>%
mutate_at(columns, funs(ifelse(. > 5, 6, .))) %>%
mutate_at(columns, funs(ifelse(. < 1, 1, .)))
returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.0 3.5 1.4 1 setosa
2 4.9 3.0 1.4 1 setosa
3 4.7 3.2 1.3 1 setosa
4 4.6 3.1 1.5 1 setosa
5 5.0 3.6 1.4 1 setosa
6 6.0 3.9 1.7 1 setosa
If you wanted to, you could instead write a function to make all of your changes for the column. This could also allow you to set the cutoffs differently for each column. For example, you may want to set the bottom and top portions of the data to be equal to that threshold (to reign in outliers for some reason), or you may know that each variable uses a dummy value as a placeholder (and that value is different by column, but is always the most common value). You could easily add in any arbitrary rule of interest this way, and it gives you a bit more flexibility than chaining together separate rules (e.g., if you use the mean, the mean changes when you change some of the values).
An example function:
modColumns <- function(x){
botThresh <- quantile(x, 0.25)
topThresh <- quantile(x, 0.75)
dummyVal <- as.numeric(names(sort(table(x)))[1])
dummyReplace <- NA
x <- ifelse(x < botThresh, botThresh, x)
x <- ifelse(x > topThresh, topThresh, x)
x <- ifelse(x == dummyVal, dummyReplace, x)
return(x)
}
And in use:
iris %>%
mutate_at(columns, modColumns) %>%
head
returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.3 1.6 0.3 setosa
2 5.1 3.0 1.6 0.3 setosa
3 5.1 3.2 1.6 0.3 setosa
4 5.1 3.1 1.6 0.3 setosa
5 5.1 3.3 1.6 0.3 setosa
6 5.4 3.3 1.7 0.4 setosa

tidyr::spread() function throws an error

I try to use gather and spread functions in tidyverse package, but it throws an error in spread function
library(caret)
dataset<-iris
# gather function is to convert wide data to long data
dataset_gather<-dataset %>% tidyr::gather(key=Type,value = Values,1:4)
head(dataset_gather)
# spead is the opposite of gather
This code below throws an error like this Error: Duplicate identifiers for rows
dataset_spead<- dataset_gather%>%tidyr::spread(key = Type,value = Values)
Added later: Sorry #alistaire, only saw your comment on the original post after posting this response.
As far as I understand Error: Duplicate identifiers for rows..., it occurs when you have values with the same identifier. For example in the original 'iris' dataset, the first five rows of Species = setosa all have a Petal.Width of 0.2, and three rows of Petal.Length have values of 1.4. Gathering those data isn't an issue, but when you try spread them, the function doesn't know what belongs to what. That is, which 0.2 Petal.Width and 1.4 Petal.Length belongs to which row of setosa.
The (tidyverse) solution I use in those circumstances is to create a unique marker for each row of data at the gather stage so that the function can keep track which duplicate data belong to which rows when you want to spread again. See example below:
# Load packages
library(dplyr)
library(tidyr)
# Get data
dataset <- iris
# View dataset
head(dataset)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
# Gather data
dataset_gathered <- dataset %>%
# Create a unique identifier for each row
mutate(marker = row_number(Species)) %>%
# Gather the data
gather(key = Type, value = Values, 1:4)
# View gathered data
head(dataset_gathered)
#> Species marker Type Values
#> 1 setosa 1 Sepal.Length 5.1
#> 2 setosa 2 Sepal.Length 4.9
#> 3 setosa 3 Sepal.Length 4.7
#> 4 setosa 4 Sepal.Length 4.6
#> 5 setosa 5 Sepal.Length 5.0
#> 6 setosa 6 Sepal.Length 5.4
# Spread it out again
dataset_spread <- dataset_gathered %>%
# Group the data by the marker
group_by(marker) %>%
# Spread it out again
spread(key = Type, value = Values) %>%
# Not essential, but remove marker
ungroup() %>%
select(-marker)
# View spread data
head(dataset_spread)
#> # A tibble: 6 x 5
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width
#> <fctr> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 1.4 0.2 5.1 3.5
#> 2 setosa 1.4 0.2 4.9 3.0
#> 3 setosa 1.3 0.2 4.7 3.2
#> 4 setosa 1.5 0.2 4.6 3.1
#> 5 setosa 1.4 0.2 5.0 3.6
#> 6 setosa 1.7 0.4 5.4 3.9
(and as ever, thanks to Jenny Bryan for the reprex package)
We can do this with data.table
library(data.table)
dcast(melt(setDT(dataset, keep.rownames = TRUE), id.var = c("rn", "Species")), rn + Species ~ variable)

Resources