I am working with the dplyr library and have created a dataframe in a pipe that looks something like this:
a <- c(1, 2, 2)
b <- c(3, 4, 4)
data <- data.frame(a, b)
data %>% summarize_all(c(min, max))
which gives me this dataframe:
a_fn1 b_fn1 a_fn2 b_fn2
1 3 2 4
and I am trying to reshape this dataframe so that the output of the pipe stacks multiple columns on top of each other in several rows that look like this:
A B
----
1 3
2 4
How would I go about this? I do not want to change how the functions are called because the summarize_all function helps me achieve the values I am looking for. I just want to know how to change this dataframe to the shape such that each value in each row is the value of the summarize function for the given column.
First, naming your functions in summarize_all() will make them appear in the result for easier wrangling.
Then, you can use pivot_longer() with the special .value sentinel in names_to to achieve what you want:
library(tidyverse)
a <- c(1, 2, 2)
b <- c(3, 4, 4)
data <- data.frame(a, b)
data %>%
summarize_all(c(min=min, max=max)) %>%
pivot_longer(everything(), names_to=c(".value", "variable"), names_pattern="(.)_(.+)")
#> # A tibble: 2 x 3
#> variable a b
#> <chr> <dbl> <dbl>
#> 1 min 1 3
#> 2 max 2 4
Created on 2021-07-22 by the reprex package (v2.0.0)
Depending on what output you want, you can even switch the order to c("variable", ".value").
Note that summarize_all() is deprecated and that you might want to use the new, more verbous syntax: summarize(across(everything(), c(min=min, max=max))).
Related
I have this input:
t <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
And want to have the rowwise nth-lowest element of the dataframe ordered by the rowwise values, so that the output is something like this (example for nth_element = 2):
[1] 2 3 5 4
I tried a function like this:
apply(t, 1, nth, n=1, order_by = .)
But this does not work. Two questions:
What should I type in the order_by gument to make this function work?
Which is the best way to summarise rows with an own summary function if I don't want to mention the column names in the rowwise summary function?
Sidenote:
I don't want to mention the column names specifically, I want the function to use all rows in the dataset.
I tried the rownth function from the Rfast package but it only provides one result. Does anybody know what I do wrong?
We can use apply and sort to do this.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
nth_lowest <- 2
apply(d, 1, FUN = function(x) sort(x)[nth_lowest])
# [1] 2 3 5 4
Note that I am calling the data d instead of t. t is already a reserved name in R (matrix transpose function).
Not as elegant as #bouncyball's answer, but using dplyr (and tidyr), one possibility is to do:
library(dplyr)
library(tidyr)
t %>% mutate(Row = row_number()) %>%
pivot_longer(-Row, names_to = "Col", values_to = "Val") %>%
group_by(Row) %>%
arrange(Val) %>%
slice(2) %>%
select(Val)
Adding missing grouping variables: `Row`
# A tibble: 4 x 2
# Groups: Row [4]
Row Val
<int> <dbl>
1 1 2
2 2 3
3 3 5
4 4 4
Using Rfast you could reduce run time for big matrices and for matrices only.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
d<- Rfast::data.frame.to_matrix(d)
nth_lowests <- rep(2,ncol(d))
Rfast::rownth(d,nth_lowests)
# [1] 2 3 5 4
You could also use the parallel version of Rfast::rownth
I'm trying to use dplyr to group by a variable and identify the nearest location for every place in my dataset. I would also like to include all rows for which distance has not been measured (NA).
# Set up df of place, distance, and destination.
df <- data.frame(place = c('A','B','B','C','C','D','D'),dist = c(NA, 4, 1, 6, 3, 1, 1), dest = 1:7)
# For each place, get the nearest destination.
df %>%
group_by(place) %>%
top_n(1, desc(dist))
# This does not return a row for place A.
Is there a tidyr solution for using top_n to identify rows based on rank that will also include rows that have not been ranked? Thank you in advance.
This works but there are probably more efficient solutions.
The coalesce(dist, max(dist), ...) is there because we prioritize non-null values. Then, we want to make sure that a random value doesn't end up in top_n, so we take the max(dist) of the group. Then finally, to actually return a value, I put a number in - you could use any number.
If you were doing non-desc, you would likely use min(dist) instead of max(dist).
df %>%
group_by(place) %>%
top_n(1, desc(coalesce(dist, max(dist)+1, 0)))
place dist dest
<fct> <dbl> <int>
1 A NA 1
2 B 1 3
3 C 3 5
4 D 1 6
5 D 1 7
My data frame looks like this:
df <- tibble(x = c(1, 2, NA),
y = c(1, NA, 3),
z = c(NA, 2, 3))
I want to replace NA with 0 using tidyr::replace_na(). As this function's documentation makes clear, it's straightforward to do this once you know which columns you want to perform the operation on.
df <- df %>% replace_na(list(x = 0, y = 0, z = 0))
But what if you have an indeterminate number of columns? (I say 'indeterminate' because I'm trying to create a function that does this on the fly using dplyr tools.) If I'm not mistaken, the base R equivalent to what I'm trying to achieve using the aforementioned tools is:
df[, 1:ncol(df)][is.na(df[, 1:ncol(df)])] <- 0
But I always struggle to get my head around this code. Thanks in advance for your help.
We can do this by creating a list of 0's based on the number of columns of dataset and set the names with the column names
library(tidyverse)
df %>%
replace_na(set_names(as.list(rep(0, length(.))), names(.)))
# A tibble: 3 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 1 0
#2 2 0 2
#3 0 3 3
Or another option is mutate_all (for selected columns -mutate_at or base don conditions mutate_if) and applyreplace_all
df %>%
mutate_all(replace_na, replace = 0)
With base R, it is more straightforward
df[is.na(df)] <- 0
Suppose I have a data frame like the following
df=data.frame(x=1:5,y=c("a","b","c","d","e"))
where y is the key column. Sometimes I want to look up values of x corresponding to a series of keys in y. To accomplish this, I can
row.names(df)=df$y
df[c("b","d","c"),c("x")]
and I will get
[1] 2 4 3
Note the order of values returned is the same as that of the series of given keys.
Now I want to achieve the same thing using tidyverse's tibble. But since tibble does not have row.names, I have no idea how to do it.
My question is, what is the "most clever" way (or idiomatic way, to borrow a term from Python) to look up values in a tibble given a series of keys, following the order of the keys?
The non-rownames way of doing this with a data.frame is
df[match(c('b', 'd', 'c'), df$y), 'x']
This works with tibbles as well. Alternatively, use dplyr verbs:
df %>% slice(match(c('b', 'd', 'c'), y)) %>% pull(x)
I would use filter
library(tidyverse)
df <- tibble(
x = 1:5,
y = c("a","b","c","d","e")
)
df %>%
filter(y %in% c("b","d","c"))
#> # A tibble: 3 x 2
#> x y
#> <int> <chr>
#> 1 2 b
#> 2 3 c
#> 3 4 d
Created on 2018-07-12 by the reprex package (v0.2.0.9000).
I would like to implement a function which has the same interface as the filter method in dplyr but instead of removing the rows not matching to a condition would, for instance, return an array with an indicator variable, or attach such column to the returned tibble?
I would find it very useful since it would allow me to compute summaries of some columns after and before filtering as well as summaries of the rows which would have been removed on a single tibble.
I find the dplyr::filter interface very convenient and therefore would like to emulate it.
I think group_by will help you here
You might normally filter then summarise like so
library(dplyr)
mtcars %>%
filter(cyl==4) %>%
summarise(mean=mean(gear))
# mean
# 1 4.090909
You can group_by, summarise, then filter
mtcars %>%
group_by(cyl) %>%
summarise(mean=mean(gear))
# optional filter here
# # A tibble: 3 x 2
# cyl mean
# <dbl> <dbl>
# 1 4 4.090909
# 2 6 3.857143
# 3 8 3.285714
You can group by conditionals as well, like so
mtcars %>%
group_by(cyl > 4) %>%
summarise(mean=mean(gear))
# # A tibble: 2 x 2
# `cyl > 4` mean
# <lgl> <dbl>
# 1 FALSE 4.090909
# 2 TRUE 3.476190
You need to quo and !! (or UQ()) . See following example:
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise <- function(df, group_by) {
quo_group_by <- quo(group_by)
print(quo_group_by)
df %>%
group_by(!!quo_group_by) %>%
summarise(a = mean(a))
}
my_summarise(df, g1)
For more examples and discussion see http://dplyr.tidyverse.org/articles/programming.html