R dataframe - Top n values in row with column names - r

I want to sort rowwise values in specific columns, get top 'n' values, and get corresponding column names in new columns.
The output would look something like this:
SL SW PL PW Species high1 high2 high3 col1 col2 col3
dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 3.5 1.4 0.2 SW PL PW
2 4.9 3 1.4 0.2 setosa 3 1.4 0.2 SW PL PW
3 4.7 3.2 1.3 0.2 setosa 3.2 1.3 0.2 SW PL PW
Tried something like code below, but unable to get column names.
What I'm hoping to achieve is to compare the highest 'n' values (rows[n]) with values in dataframe for each row, and then extract corresponding column name of matching value. For eg. rows[1] == 3.5 (from column 'SW'). Is this feasible?
Help appreciated.
iris %>%
rowwise() %>%
mutate(rows = list(sort(c( Sepal.Width, Petal.Length, Petal.Width), decreasing = TRUE))) %>%
mutate(high1 = rows[1], col1 = names(~.)[which(~.[] ==rows[1]),
high2 = rows[2], col2 = names(~.)[which(~.[] ==rows[2]),
high3 = rows[3], col3 = names(~.)[which(~.[] ==rows[3])
) %>%
select(-rows)

You could pivot to long, group by the corresponding original row, use slice_max to get the top values, then pivot back to wide and bind that output to the original table.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
iris %>%
group_by(rn = row_number()) %>%
pivot_longer(-c(Species, rn), 'col', values_to = 'high') %>%
slice_max(col, n = 2) %>%
mutate(nm = row_number()) %>%
pivot_wider(values_from = c(high, col),
names_from = nm) %>%
ungroup() %>%
select(-c(Species, rn)) %>%
bind_cols(iris)
#> # A tibble: 150 × 9
#> high_1 high_2 col_1 col_2 Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 Sepal.… Sepa… 5.1 3.5 1.4 0.2
#> 2 4.9 3 Sepal.… Sepa… 4.9 3 1.4 0.2
#> 3 4.7 3.2 Sepal.… Sepa… 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 Sepal.… Sepa… 4.6 3.1 1.5 0.2
#> 5 5 3.6 Sepal.… Sepa… 5 3.6 1.4 0.2
#> 6 5.4 3.9 Sepal.… Sepa… 5.4 3.9 1.7 0.4
#> 7 4.6 3.4 Sepal.… Sepa… 4.6 3.4 1.4 0.3
#> 8 5 3.4 Sepal.… Sepa… 5 3.4 1.5 0.2
#> 9 4.4 2.9 Sepal.… Sepa… 4.4 2.9 1.4 0.2
#> 10 4.9 3.1 Sepal.… Sepa… 4.9 3.1 1.5 0.1
#> # … with 140 more rows, and 1 more variable: Species <fct>
Created on 2022-02-16 by the reprex package (v2.0.1)
Edited to remove the unnecessary rename and mutate, thanks to tip from #Onyambu!

My approach is to make a function that takes any dataframe (df), any set of columns that you want to focus on (cols), and any value for top n (n)
# load data.table and magrittr (I only use %>% for illustration here)
library(data.table)
library(magrittr)
# define function
get_high_vals_cols <- function(df, cols, n=3) {
setDT(df)[, `_rowid`:=.I]
df_l <- melt(df,id = "_rowid",measure.vars = cols, variable.name = "col",value.name = "high") %>%
.[order(-high), .SD[1:n], by="_rowid"] %>%
.[,id:=1:.N, by="_rowid"]
dcast(df_l, `_rowid`~id, value.var = list("col", "high"))[,`_rowid`:=NULL]
}
Then, you can feed any dataframe to this function, along with any columns of interest
cols= c("Sepal.Width", "Petal.Length", "Petal.Width")
get_high_vals_cols(iris,cols,3)
Output
col_1 col_2 col_3 high_1 high_2 high_3
1: Sepal.Width Petal.Length Petal.Width 3.5 1.4 0.2
2: Sepal.Width Petal.Length Petal.Width 3.0 1.4 0.2
3: Sepal.Width Petal.Length Petal.Width 3.2 1.3 0.2
4: Sepal.Width Petal.Length Petal.Width 3.1 1.5 0.2
5: Sepal.Width Petal.Length Petal.Width 3.6 1.4 0.2
---
146: Petal.Length Sepal.Width Petal.Width 5.2 3.0 2.3
147: Petal.Length Sepal.Width Petal.Width 5.0 2.5 1.9
148: Petal.Length Sepal.Width Petal.Width 5.2 3.0 2.0
149: Petal.Length Sepal.Width Petal.Width 5.4 3.4 2.3
150: Petal.Length Sepal.Width Petal.Width 5.1 3.0 1.8

Related

How to add new row with concatenated strings for each group? [duplicate]

If I add a new row to the iris dataset with:
iris <- as_tibble(iris)
> iris %>%
add_row(.before=0)
# A tibble: 151 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 NA NA NA NA <NA> <--- Good!
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3.0 1.4 0.2 setosa
It works. So, why can't I add a new row on top of each "subset" with:
iris %>%
group_by(Species) %>%
add_row(.before=0)
Error: is.data.frame(df) is not TRUE
If you want to use a grouped operation, you need do like JasonWang described in his comment, as other functions like mutate or summarise expect a result with the same number of rows as the grouped data frame (in your case, 50) or with one row (e.g. when summarising).
As you probably know, in general do can be slow and should be a last resort if you cannot achieve your result in another way. Your task is quite simple because it only involves adding extra rows in your data frame, which can be done by simple indexing, e.g. look at the output of iris[NA, ].
What you want is essentially to create a vector
indices <- c(NA, 1:50, NA, 51:100, NA, 101:150)
(since the first group is in rows 1 to 50, the second one in 51 to 100 and the third one in 101 to 150).
The result is then iris[indices, ].
A more general way of building this vector uses group_indices.
indices <- seq(nrow(iris)) %>%
split(group_indices(iris, Species)) %>%
map(~c(NA, .x)) %>%
unlist
(map comes from purrr which I assume you have loaded as you have tagged this with tidyverse).
A more recent version would be using group_modify() instead of do().
iris %>%
as_tibble() %>%
group_by(Species) %>%
group_modify(~ add_row(.x,.before=0))
#> # A tibble: 153 x 5
#> # Groups: Species [3]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NA NA NA NA
#> 2 setosa 5.1 3.5 1.4 0.2
#> 3 setosa 4.9 3 1.4 0.2
With a slight variation, this could also be done:
library(purrr)
library(tibble)
iris %>%
group_split(Species) %>%
map_dfr(~ .x %>%
add_row(.before = 1))
# A tibble: 153 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 NA NA NA NA NA
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3 1.4 0.2 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 5 3.6 1.4 0.2 setosa
7 5.4 3.9 1.7 0.4 setosa
8 4.6 3.4 1.4 0.3 setosa
9 5 3.4 1.5 0.2 setosa
10 4.4 2.9 1.4 0.2 setosa
# ... with 143 more rows
This also can be used for grouped data frame, however, it's a bit verbose:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = c(NA, Sepal.Length),
Sepal.Width = c(NA, Sepal.Width),
Petal.Length = c(NA, Petal.Length),
Petal.Width = c(NA, Petal.Width),
Species = c(NA, Species))

Using a custom function with tidyverse

I created a dummy function to get the lag of one variable and I want to use it with other tidyverse functions. It works after I call mutate but not after calling group_by. It throws the following error:
Error in mutate_impl(.data, dots) :
Not compatible with STRSXP: [type=NULL].
Here is a repex:
#create a function to lag a selected variable
lag_func <- function(df, x) {
mutate(df, lag = lag(df[,x]))
}
#works
iris %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func('Petal.Length')
#doesn't work
iris %>%
group_by(Species) %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func('Petal.Length')
Any idea what the error means and/or how to fix it?
The best way to pass a column name as an argument to a tidyverse function is convert it to quosure using enquo(). See this code:
lag_func <- function(df, x) {
x <- enquo(x)
mutate(df, lag = lag(!!x)) # !! is to evaluate rather than quoting (x)
}
Now let's try our function:
iris %>%
group_by(Species) %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func(Petal.Length)
# A tibble: 150 x 7
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species lead lag
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.4 NA
2 4.9 3 1.4 0.2 setosa 1.3 1.4
3 4.7 3.2 1.3 0.2 setosa 1.5 1.4
4 4.6 3.1 1.5 0.2 setosa 1.4 1.3
5 5 3.6 1.4 0.2 setosa 1.7 1.5
6 5.4 3.9 1.7 0.4 setosa 1.4 1.4
7 4.6 3.4 1.4 0.3 setosa 1.5 1.7
8 5 3.4 1.5 0.2 setosa 1.4 1.4
9 4.4 2.9 1.4 0.2 setosa 1.5 1.5
10 4.9 3.1 1.5 0.1 setosa 1.5 1.4
# ... with 140 more rows
For more info on how to use tidyverse functions within your custom functions see here

Conditional non-equi join

library(tidyverse)
iris <- iris
means <- iris %>%
group_by(Species) %>%
summarise_all(funs(mean))
sd <- iris %>%
group_by(Species) %>%
summarise_all(funs(sd))
bottom <- means[ ,2:5] - sd[ ,2:5]
bottom$Species <- c("setosa", "versicolor", "virginica")
print(bottom)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.653510 3.048936 1.288336 0.1406144 setosa
2 5.419829 2.456202 3.790089 1.1282473 versicolor
3 5.952120 2.651503 5.000105 1.7513499 virginica
top <- means[ ,2:5] + sd[ ,2:5]
top$Species <- c("setosa", "versicolor", "virginica")
print(top)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.358490 3.807064 1.635664 0.3513856 setosa
2 6.452171 3.083798 4.729911 1.5237527 versicolor
3 7.223880 3.296497 6.103895 2.3006501 virginica
How do I get the rows of Iris where the values for Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width all fall between the values in the top and bottom data frames?
For example, I only want setosa rows where Sepal.Length > 4.65 & Sepal.Length < 5.35 and Sepal.Width is between 3.04 and 3.80, etc. Ideally the end result contains only the 4 numeric columns and the species column.
Thanks.
It would be much easier if you can filter from the beginning without the summarize step:
iris %>%
group_by(Species) %>%
filter_if(is.numeric, all_vars(. < mean(.) + sd(.) & . > mean(.) - sd(.)))
# A tibble: 54 x 5
# Groups: Species [3]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fctr>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
# 3 5.0 3.6 1.4 0.2 setosa
# 4 5.0 3.4 1.5 0.2 setosa
# 5 4.8 3.4 1.6 0.2 setosa
# 6 5.1 3.5 1.4 0.3 setosa
# 7 5.1 3.8 1.5 0.3 setosa
# 8 5.2 3.5 1.5 0.2 setosa
# 9 5.2 3.4 1.4 0.2 setosa
#10 4.7 3.2 1.6 0.2 setosa
# ... with 44 more rows
Not sure if you can avoid the summarize step, post as an option here.
Or use between:
iris %>%
group_by(Species) %>%
filter_if(is.numeric, all_vars(between(., mean(.) - sd(.), mean(.) + sd(.))))
Here is a solution using non-equi joins which is building on the (now deleted) approach of #Frank:
library(data.table)
# add a row number column and to reshape from wide to long
DT <- melt(data.table(iris)[, rn := .I], id = c("rn", "Species"))
# compute lower and upper bound for each variable and Species
mDT <- DT[, .(lb = lb <- mean(value) - (s <- sd(value)),
ub = lb + 2 * s), by = .(Species, variable)]
# find row numbers of items which fulfill conditions
selected_rn <-
# non-equi join
DT[DT[mDT, on = .(Species, variable, value > lb, value < ub), which = TRUE]][
# all uniqueN(mDT$variable) variables must have been selected
# for an item to pass (thanks to #Frank for tip to avoid hardcoded value)
, .N, by = rn][N == uniqueN(mDT$variable), rn]
head(iris[sort(selected_rn),])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
18 5.1 3.5 1.4 0.3 setosa

tidyr::spread() function throws an error

I try to use gather and spread functions in tidyverse package, but it throws an error in spread function
library(caret)
dataset<-iris
# gather function is to convert wide data to long data
dataset_gather<-dataset %>% tidyr::gather(key=Type,value = Values,1:4)
head(dataset_gather)
# spead is the opposite of gather
This code below throws an error like this Error: Duplicate identifiers for rows
dataset_spead<- dataset_gather%>%tidyr::spread(key = Type,value = Values)
Added later: Sorry #alistaire, only saw your comment on the original post after posting this response.
As far as I understand Error: Duplicate identifiers for rows..., it occurs when you have values with the same identifier. For example in the original 'iris' dataset, the first five rows of Species = setosa all have a Petal.Width of 0.2, and three rows of Petal.Length have values of 1.4. Gathering those data isn't an issue, but when you try spread them, the function doesn't know what belongs to what. That is, which 0.2 Petal.Width and 1.4 Petal.Length belongs to which row of setosa.
The (tidyverse) solution I use in those circumstances is to create a unique marker for each row of data at the gather stage so that the function can keep track which duplicate data belong to which rows when you want to spread again. See example below:
# Load packages
library(dplyr)
library(tidyr)
# Get data
dataset <- iris
# View dataset
head(dataset)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
# Gather data
dataset_gathered <- dataset %>%
# Create a unique identifier for each row
mutate(marker = row_number(Species)) %>%
# Gather the data
gather(key = Type, value = Values, 1:4)
# View gathered data
head(dataset_gathered)
#> Species marker Type Values
#> 1 setosa 1 Sepal.Length 5.1
#> 2 setosa 2 Sepal.Length 4.9
#> 3 setosa 3 Sepal.Length 4.7
#> 4 setosa 4 Sepal.Length 4.6
#> 5 setosa 5 Sepal.Length 5.0
#> 6 setosa 6 Sepal.Length 5.4
# Spread it out again
dataset_spread <- dataset_gathered %>%
# Group the data by the marker
group_by(marker) %>%
# Spread it out again
spread(key = Type, value = Values) %>%
# Not essential, but remove marker
ungroup() %>%
select(-marker)
# View spread data
head(dataset_spread)
#> # A tibble: 6 x 5
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width
#> <fctr> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 1.4 0.2 5.1 3.5
#> 2 setosa 1.4 0.2 4.9 3.0
#> 3 setosa 1.3 0.2 4.7 3.2
#> 4 setosa 1.5 0.2 4.6 3.1
#> 5 setosa 1.4 0.2 5.0 3.6
#> 6 setosa 1.7 0.4 5.4 3.9
(and as ever, thanks to Jenny Bryan for the reprex package)
We can do this with data.table
library(data.table)
dcast(melt(setDT(dataset, keep.rownames = TRUE), id.var = c("rn", "Species")), rn + Species ~ variable)

Add row in each group using dplyr and add_row()

If I add a new row to the iris dataset with:
iris <- as_tibble(iris)
> iris %>%
add_row(.before=0)
# A tibble: 151 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 NA NA NA NA <NA> <--- Good!
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3.0 1.4 0.2 setosa
It works. So, why can't I add a new row on top of each "subset" with:
iris %>%
group_by(Species) %>%
add_row(.before=0)
Error: is.data.frame(df) is not TRUE
If you want to use a grouped operation, you need do like JasonWang described in his comment, as other functions like mutate or summarise expect a result with the same number of rows as the grouped data frame (in your case, 50) or with one row (e.g. when summarising).
As you probably know, in general do can be slow and should be a last resort if you cannot achieve your result in another way. Your task is quite simple because it only involves adding extra rows in your data frame, which can be done by simple indexing, e.g. look at the output of iris[NA, ].
What you want is essentially to create a vector
indices <- c(NA, 1:50, NA, 51:100, NA, 101:150)
(since the first group is in rows 1 to 50, the second one in 51 to 100 and the third one in 101 to 150).
The result is then iris[indices, ].
A more general way of building this vector uses group_indices.
indices <- seq(nrow(iris)) %>%
split(group_indices(iris, Species)) %>%
map(~c(NA, .x)) %>%
unlist
(map comes from purrr which I assume you have loaded as you have tagged this with tidyverse).
A more recent version would be using group_modify() instead of do().
iris %>%
as_tibble() %>%
group_by(Species) %>%
group_modify(~ add_row(.x,.before=0))
#> # A tibble: 153 x 5
#> # Groups: Species [3]
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa NA NA NA NA
#> 2 setosa 5.1 3.5 1.4 0.2
#> 3 setosa 4.9 3 1.4 0.2
With a slight variation, this could also be done:
library(purrr)
library(tibble)
iris %>%
group_split(Species) %>%
map_dfr(~ .x %>%
add_row(.before = 1))
# A tibble: 153 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 NA NA NA NA NA
2 5.1 3.5 1.4 0.2 setosa
3 4.9 3 1.4 0.2 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.6 3.1 1.5 0.2 setosa
6 5 3.6 1.4 0.2 setosa
7 5.4 3.9 1.7 0.4 setosa
8 4.6 3.4 1.4 0.3 setosa
9 5 3.4 1.5 0.2 setosa
10 4.4 2.9 1.4 0.2 setosa
# ... with 143 more rows
This also can be used for grouped data frame, however, it's a bit verbose:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise(Sepal.Length = c(NA, Sepal.Length),
Sepal.Width = c(NA, Sepal.Width),
Petal.Length = c(NA, Petal.Length),
Petal.Width = c(NA, Petal.Width),
Species = c(NA, Species))

Resources