Conditional non-equi join - r

library(tidyverse)
iris <- iris
means <- iris %>%
group_by(Species) %>%
summarise_all(funs(mean))
sd <- iris %>%
group_by(Species) %>%
summarise_all(funs(sd))
bottom <- means[ ,2:5] - sd[ ,2:5]
bottom$Species <- c("setosa", "versicolor", "virginica")
print(bottom)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.653510 3.048936 1.288336 0.1406144 setosa
2 5.419829 2.456202 3.790089 1.1282473 versicolor
3 5.952120 2.651503 5.000105 1.7513499 virginica
top <- means[ ,2:5] + sd[ ,2:5]
top$Species <- c("setosa", "versicolor", "virginica")
print(top)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.358490 3.807064 1.635664 0.3513856 setosa
2 6.452171 3.083798 4.729911 1.5237527 versicolor
3 7.223880 3.296497 6.103895 2.3006501 virginica
How do I get the rows of Iris where the values for Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width all fall between the values in the top and bottom data frames?
For example, I only want setosa rows where Sepal.Length > 4.65 & Sepal.Length < 5.35 and Sepal.Width is between 3.04 and 3.80, etc. Ideally the end result contains only the 4 numeric columns and the species column.
Thanks.

It would be much easier if you can filter from the beginning without the summarize step:
iris %>%
group_by(Species) %>%
filter_if(is.numeric, all_vars(. < mean(.) + sd(.) & . > mean(.) - sd(.)))
# A tibble: 54 x 5
# Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
# 3 5.0 3.6 1.4 0.2 setosa
# 4 5.0 3.4 1.5 0.2 setosa
# 5 4.8 3.4 1.6 0.2 setosa
# 6 5.1 3.5 1.4 0.3 setosa
# 7 5.1 3.8 1.5 0.3 setosa
# 8 5.2 3.5 1.5 0.2 setosa
# 9 5.2 3.4 1.4 0.2 setosa
#10 4.7 3.2 1.6 0.2 setosa
# ... with 44 more rows
Not sure if you can avoid the summarize step, post as an option here.
Or use between:
iris %>%
group_by(Species) %>%
filter_if(is.numeric, all_vars(between(., mean(.) - sd(.), mean(.) + sd(.))))

Here is a solution using non-equi joins which is building on the (now deleted) approach of #Frank:
library(data.table)
# add a row number column and to reshape from wide to long
DT <- melt(data.table(iris)[, rn := .I], id = c("rn", "Species"))
# compute lower and upper bound for each variable and Species
mDT <- DT[, .(lb = lb <- mean(value) - (s <- sd(value)),
ub = lb + 2 * s), by = .(Species, variable)]
# find row numbers of items which fulfill conditions
selected_rn <-
# non-equi join
DT[DT[mDT, on = .(Species, variable, value > lb, value < ub), which = TRUE]][
# all uniqueN(mDT$variable) variables must have been selected
# for an item to pass (thanks to #Frank for tip to avoid hardcoded value)
, .N, by = rn][N == uniqueN(mDT$variable), rn]
head(iris[sort(selected_rn),])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
18 5.1 3.5 1.4 0.3 setosa

Related

How to mutate a column with if/then in R Data.frame

Lets suppose if the data is
data <- head(iris)
How can i create a new column whose values will be derived from data$Sepal.Length in a way that if data$Sepal.Length is equal to or greater than 5, value will be 5 and if its less or equal to 3, value will be 3, else values should remain same...
I have tried
data %>% mutate(Sepal.Length = case_when(Sepal.Length <=3 ~ '3',Sepal.Length>=5 ~ '5'))
But it is giving NA to remaining values..
You can do this using a basic case_when statement:
data %>%
mutate(Sepal.Length = case_when(
Sepal.Length <= 3 ~ 3,
Sepal.Length >= 5 ~ 5,
TRUE ~ Sepal.Length))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5 3.6 1.4 0.2 setosa
# 6 5 3.9 1.7 0.4 setosa```

R dataframe - Top n values in row with column names

I want to sort rowwise values in specific columns, get top 'n' values, and get corresponding column names in new columns.
The output would look something like this:
SL SW PL PW Species high1 high2 high3 col1 col2 col3
dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 3.5 1.4 0.2 SW PL PW
2 4.9 3 1.4 0.2 setosa 3 1.4 0.2 SW PL PW
3 4.7 3.2 1.3 0.2 setosa 3.2 1.3 0.2 SW PL PW
Tried something like code below, but unable to get column names.
What I'm hoping to achieve is to compare the highest 'n' values (rows[n]) with values in dataframe for each row, and then extract corresponding column name of matching value. For eg. rows[1] == 3.5 (from column 'SW'). Is this feasible?
Help appreciated.
iris %>%
rowwise() %>%
mutate(rows = list(sort(c( Sepal.Width, Petal.Length, Petal.Width), decreasing = TRUE))) %>%
mutate(high1 = rows[1], col1 = names(~.)[which(~.[] ==rows[1]),
high2 = rows[2], col2 = names(~.)[which(~.[] ==rows[2]),
high3 = rows[3], col3 = names(~.)[which(~.[] ==rows[3])
) %>%
select(-rows)
You could pivot to long, group by the corresponding original row, use slice_max to get the top values, then pivot back to wide and bind that output to the original table.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
iris %>%
group_by(rn = row_number()) %>%
pivot_longer(-c(Species, rn), 'col', values_to = 'high') %>%
slice_max(col, n = 2) %>%
mutate(nm = row_number()) %>%
pivot_wider(values_from = c(high, col),
names_from = nm) %>%
ungroup() %>%
select(-c(Species, rn)) %>%
bind_cols(iris)
#> # A tibble: 150 × 9
#> high_1 high_2 col_1 col_2 Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 Sepal.… Sepa… 5.1 3.5 1.4 0.2
#> 2 4.9 3 Sepal.… Sepa… 4.9 3 1.4 0.2
#> 3 4.7 3.2 Sepal.… Sepa… 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 Sepal.… Sepa… 4.6 3.1 1.5 0.2
#> 5 5 3.6 Sepal.… Sepa… 5 3.6 1.4 0.2
#> 6 5.4 3.9 Sepal.… Sepa… 5.4 3.9 1.7 0.4
#> 7 4.6 3.4 Sepal.… Sepa… 4.6 3.4 1.4 0.3
#> 8 5 3.4 Sepal.… Sepa… 5 3.4 1.5 0.2
#> 9 4.4 2.9 Sepal.… Sepa… 4.4 2.9 1.4 0.2
#> 10 4.9 3.1 Sepal.… Sepa… 4.9 3.1 1.5 0.1
#> # … with 140 more rows, and 1 more variable: Species <fct>
Created on 2022-02-16 by the reprex package (v2.0.1)
Edited to remove the unnecessary rename and mutate, thanks to tip from #Onyambu!
My approach is to make a function that takes any dataframe (df), any set of columns that you want to focus on (cols), and any value for top n (n)
# load data.table and magrittr (I only use %>% for illustration here)
library(data.table)
library(magrittr)
# define function
get_high_vals_cols <- function(df, cols, n=3) {
setDT(df)[, `_rowid`:=.I]
df_l <- melt(df,id = "_rowid",measure.vars = cols, variable.name = "col",value.name = "high") %>%
.[order(-high), .SD[1:n], by="_rowid"] %>%
.[,id:=1:.N, by="_rowid"]
dcast(df_l, `_rowid`~id, value.var = list("col", "high"))[,`_rowid`:=NULL]
}
Then, you can feed any dataframe to this function, along with any columns of interest
cols= c("Sepal.Width", "Petal.Length", "Petal.Width")
get_high_vals_cols(iris,cols,3)
Output
col_1 col_2 col_3 high_1 high_2 high_3
1: Sepal.Width Petal.Length Petal.Width 3.5 1.4 0.2
2: Sepal.Width Petal.Length Petal.Width 3.0 1.4 0.2
3: Sepal.Width Petal.Length Petal.Width 3.2 1.3 0.2
4: Sepal.Width Petal.Length Petal.Width 3.1 1.5 0.2
5: Sepal.Width Petal.Length Petal.Width 3.6 1.4 0.2
---
146: Petal.Length Sepal.Width Petal.Width 5.2 3.0 2.3
147: Petal.Length Sepal.Width Petal.Width 5.0 2.5 1.9
148: Petal.Length Sepal.Width Petal.Width 5.2 3.0 2.0
149: Petal.Length Sepal.Width Petal.Width 5.4 3.4 2.3
150: Petal.Length Sepal.Width Petal.Width 5.1 3.0 1.8

which.max not functioning as expected

I am trying to create a table that includes the value of y for when x is equal to or less than a certain value, by group. Below is my code using the iris data set.
For "<=2.5", I expect to get 4.5, 5.0, or 5.8 for the virginica group, since these are the values of Petal.Length associated with a Sepal.Width of 2.5 for virginica. But instead, I get 6.0. Any ideas of where I went wrong? (My actual data set does not have duplicates of the variable analogous to Sepal.Width for the same group, so choosing among those is not an issue for me.)
data(iris)
my.table <- iris %>%
group_by(Species) %>%
summarise("<=2.5" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=2.5])],
"<=3" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=3])],
"<=3.5" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=3.5])],
"<=4" = Petal.Length[which.max(Sepal.Width[Sepal.Width<=4])])
This is related to the question Create a table with values from ecdf graph
The problem is that you are first subsetting the Sepal.Width. Consequently, the index returned by which.max applies to that sub-vector, and no longer corresponds to the indices of the whole Petal.Length vector.
To fix this, you also need to subset Petal.Length correspondingly, e.g.
…
`<=2.5` = Petal.Length[Sepal.Width <= 2.5][which.max(Sepal.Width[Sepal.Width <= 2.5])],
…
… of course this gets rather verbose and repetitive. It might be better to perform the subsetting in a separate step. However, this means creating new columns for every threshold value.
Incidentally, this is unrelated to dplyr.
To make it more scalable, using double loop:
myCuts <- c(2.5, 3, 3.5, 4)
res <- sapply(split(iris, iris$Species), function(i)
sapply(myCuts, function(j){
x <- i[ i$Sepal.Width <= j, ]
x$Petal.Length[ which.max(x$Sepal.Width) ]
}))
rownames(res) <- paste0("<=", myCuts)
res
# setosa versicolor virginica
# <=2.5 1.3 3.9 4.5
# <=3 1.4 4.2 5.9
# <=3.5 1.4 4.5 5.6
# <=4 1.2 4.5 6.7
Here's another way to get the same data. Create a group variable according to Sepal.Width values. Then within each group, select the row with the top Sepal.Width value. It is in a different "shape", but you can always pivot_wider if you want all the values as columns instead of rows.
iris %>%
group_by(Species,
Sepal.Width_grp = case_when(Sepal.Width <= 2.5 ~ '<=2.5',
Sepal.Width <= 3 ~ '<=3',
Sepal.Width <= 3.5 ~ '<=3.5',
Sepal.Width <= 4 ~ '<=4',
TRUE ~ '> 4')) %>%
top_n(1, -Sepal.Width) %>%
select(Species, Sepal.Width_grp, Top.Sepal.Width = Sepal.Width, Petal.Width)
# # A tibble: 25 x 4
# # Groups: Species, Sepal.Width_grp [12]
# Species Sepal.Width_grp Top.Sepal.Width Petal.Width
# <fct> <chr> <dbl> <dbl>
# 1 setosa <=3.5 3.1 0.2
# 2 setosa <=4 3.6 0.2
# 3 setosa <=3 2.9 0.2
# 4 setosa <=3.5 3.1 0.1
# 5 setosa <=4 3.6 0.2
# 6 setosa <=3.5 3.1 0.2
# 7 setosa > 4 4.1 0.1
# 8 setosa <=3.5 3.1 0.2
# 9 setosa <=4 3.6 0.1
# 10 setosa <=2.5 2.3 0.3
# # ... with 15 more rows
Edit: A little simpler if you use cut
iris %>%
group_by(Species,
Sepal.Width_grp = cut(Sepal.Width, c(0, 2.5, 3, 3.5, 4, Inf))) %>%
top_n(1, -Sepal.Width) %>%
select(Species, Sepal.Width_grp, Top.Sepal.Width = Sepal.Width, Petal.Width)
# # A tibble: 25 x 4
# # Groups: Species, Sepal.Width_grp [12]
# Species Sepal.Width_grp Top.Sepal.Width Petal.Width
# <fct> <fct> <dbl> <dbl>
# 1 setosa (3,3.5] 3.1 0.2
# 2 setosa (3.5,4] 3.6 0.2
# 3 setosa (2.5,3] 2.9 0.2
# 4 setosa (3,3.5] 3.1 0.1
# 5 setosa (3.5,4] 3.6 0.2
# 6 setosa (3,3.5] 3.1 0.2
# 7 setosa (4,Inf] 4.1 0.1
# 8 setosa (3,3.5] 3.1 0.2
# 9 setosa (3.5,4] 3.6 0.1
# 10 setosa (0,2.5] 2.3 0.3
# # ... with 15 more rows

dplyr nested ifelse errors - is it vector recycling?

I can write this code that adds two columns to the iris data set. The first added column is a sum of the first four columns. The second added column is my attempt at "programming".
iris.size <- iris %>%
mutate(Total =
apply(.[(1:4)], 1, sum)
) %>%
mutate(Size =
ifelse(
apply(.[(1:4)], 1, sum) != 0 &
.[2] > .[3], "Output1",
ifelse(
apply(.[(1:4)], 1, sum) == 0 &
.[2] > .[3], "Output2",
"Output3")
)
)
You'll notice this code does not throw any errors and it does output what I want it to output. But watch what happens when I try my next step in analysis.
iris.size %>% arrange(Size)
Error: Column Size must be a 1d atomic vector or a list
It must be my ifelse logic. Correct? Ifelse logic seems straightforward. If condition 1 than output1, otherwise if condition 2 than output2, otherwise output3.
I ended up forcing iris.size$Size into a vector using as.vector but I'd like to know where my logic went wrong in the first place so I don't have to resort to using band aids in the future. After some googling it sounds like if statements are preferred over ifelse statements in R, but if statements only seem to work on single logical values, not vectors.
When you run your code, you get this output as iris.size:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Total Sepal.Width
1 5.1 3.5 1.4 0.2 setosa 10.2 Output1
2 4.9 3.0 1.4 0.2 setosa 9.5 Output1
3 4.7 3.2 1.3 0.2 setosa 9.4 Output1
4 4.6 3.1 1.5 0.2 setosa 9.4 Output1
5 5.0 3.6 1.4 0.2 setosa 10.2 Output1
6 5.4 3.9 1.7 0.4 setosa 11.4 Output1
The reason why it's not displaying Size is because the column Size has not been created. The reason that is occurring is because you're comparing two objects of class data.frame() with .[2] > .[3], not two vectors which would happen with .[, 2] > .[, 3].
I'm still trying to understand what is being created. What is that Sepal.Width column?
Adjust yours with the following:
iris.size <- iris %>% mutate(Total =
apply(.[(1:4)], 1, sum) ) %>% mutate(Size =
ifelse(
apply(.[(1:4)], 1, sum) != 0 &
.[,2] > .[,3], "Output1",
ifelse(
apply(.[(1:4)], 1, sum) == 0 &
.[,2] > .[,3], "Output2",
"Output3")
) )
iris.size
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Total Size
1 5.1 3.5 1.4 0.2 setosa 10.2 Output1
2 4.9 3.0 1.4 0.2 setosa 9.5 Output1
3 4.7 3.2 1.3 0.2 setosa 9.4 Output1
4 4.6 3.1 1.5 0.2 setosa 9.4 Output1
5 5.0 3.6 1.4 0.2 setosa 10.2 Output1
6 5.4 3.9 1.7 0.4 setosa 11.4 Output1
Suggestion:
Here's a condensed version of your code, if you're interested. You can replace Sepal.Width and Sepal.Length with .[,2] and .[,3] if need be.
iris.size <- iris %>%
mutate(Total = rowSums(.[,sapply(., is.numeric)]),
Size = ifelse(Total != 0 & Sepal.Width > Sepal.Length, "Output1",
ifelse(Total == 0 & Sepal.Width > Sepal.Length, "Output2", "Output3")))%>%
arrange(Size)
iris.size
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Total Size
1 5.1 3.5 1.4 0.2 setosa 10.2 Output1
2 4.9 3.0 1.4 0.2 setosa 9.5 Output1
3 4.7 3.2 1.3 0.2 setosa 9.4 Output1
4 4.6 3.1 1.5 0.2 setosa 9.4 Output1
5 5.0 3.6 1.4 0.2 setosa 10.2 Output1
6 5.4 3.9 1.7 0.4 setosa 11.4 Output1
Making use of rowwise and splitting things up a bit for readability...
iris.size <- iris %>%
mutate(Total =
apply(.[(1:4)], 1, sum)
)
iris.size <-iris.size %>% rowwise %>% mutate(Size =
if(
Total != 0 && Sepal.Width > Petal.Length) {
"Output1"
} else {
if(Total == 0 && Petal.Length > Petal.Length){
"Output2"
} else {
"Output3"}}
)
class(iris.size$Size)
[1] "character"
> iris.size %>% arrange(Size)
# A tibble: 150 x 7
Sepal.Length Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
7 4.6 3.4 1.4 0.3
8 5.0 3.4 1.5 0.2
9 4.4 2.9 1.4 0.2
10 4.9 3.1 1.5 0.1
# ... with 140 more rows, and 3 more variables:
# Species <fctr>, Total <dbl>, Size <chr>
>
The error message is caused by the fact that iris.size["Size"] is an object of type data.frame(). This can be confirmed by the str() function:
> str(iris.size["Size"])
'data.frame': 150 obs. of 1 variable:
$ Size: chr [1:150, 1] "Output1" "Output1" "Output1" "Output1" ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "Sepal.Width"
>
Casting the object with as.vector() resolves the problem because the data frame contains 1 column.

Mutate new variable based on group-level statistic

I want to append the group-maximum to table of observations, e.g:
iris %>% split(iris$Species) %>%
lapply(function(l) mutate(l, species_max = max(Sepal.Width))) %>%
bind_rows() %>% .[c(1,51,101),]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species species_max
1 5.1 3.5 1.4 0.2 setosa 4.4
51 7.0 3.2 4.7 1.4 versicolor 3.4
101 6.3 3.3 6.0 2.5 virginica 3.8
Is there a more elegant dplyr::group_by solution to achieve this?
How about this:
group_by(iris, Species) %>%
mutate(species_max = max(Sepal.Width)) %>%
slice(1)
# Source: local data frame [3 x 6]
# Groups: Species [3]
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species species_max
# <dbl> <dbl> <dbl> <dbl> <fctr> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 4.4
# 2 7.0 3.2 4.7 1.4 versicolor 3.4
# 3 6.3 3.3 6.0 2.5 virginica 3.8
The difficulty here is that you need to summarise multiple columns (for which summarise_all would be great) but at the same time you need to add a new column (for which you either need a simple summarise or mutate call).
In this regard data.table allows greater flexibility since it only relies on a list in its j-argument. So you can do it as follows with data.table, just as a comparison:
library(data.table)
dt <- as.data.table(iris)
dt[, c(lapply(.SD, first), species_max = max(Sepal.Width)), by = Species]

Resources