I have a dataframe with two sets of columns v and t amongst others.
library(tidyverse)
(df <-
tibble(id = 1,
v1 = 1, v2 = 2, v3 = 3,
t1 = "a", t2 = "b", t3 = "c"
)
)
#> # A tibble: 1 × 7
#> id v1 v2 v3 t1 t2 t3
#> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 1 1 2 3 a b c
I want my output to be three rows long. I think one way I can do this is by nesting the similar columns, and unnest_longer. But this is not allowed.
## unnest_longer can't handle multiple cols
df %>%
nest(v = c(v1, v2, v3),
t = c(t1, t2, t3)) %>%
unnest_longer(c("v", "t"))
#> Error: Must extract column with a single valid subscript.
#> x Subscript `var` has size 2 but must be size 1.
Is it possible to unnest_longer multiple columns at once?
According to the documentation of ?unnest_longer takes only a single column
.col, col -
List-column to extract components from.
whereas the argument in unnest is cols (which can unnest more than one column)
Perhaps, the OP wanted to use pivot_longer instead of nest/unnest i.e. reshape to 'long' format by specifying the cols without the 'id' column, return the .value and capture the non digits (\\D+) in the column name as names_pattern
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -id, names_to = ".value",
names_pattern = "^(\\D+).*")
# A tibble: 3 × 3
# id v t
# <dbl> <dbl> <chr>
#1 1 1 a
#2 1 2 b
#3 1 3 c
Related
tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the example below, I would like to add column 'value' based on the values of column 'variable' (i.e., 1 and 20).
toy_data <-
tibble::tribble(
~x, ~y, ~variable,
1, 2, "x",
10, 20, "y"
)
Like this:
x
y
variable
value
1
2
x
1
10
20
y
20
However, none of the below works:
toy_data %>%
dplyr::mutate(
value = get(variable)
)
toy_data %>%
dplyr::mutate(
value = mget(variable)
)
toy_data %>%
dplyr::mutate(
value = mget(variable, inherits = TRUE)
)
toy_data %>%
dplyr::mutate(
value = !!variable
)
How can I do this?
If you know which variables you have in the dataframe in advance: use simple logic like ifelse() or dplyr::case_when() to choose between them.
If not: use functional programming. Under is an example:
library(dplyr)
f <- function(data, variable_col) {
data[[variable_col]] %>%
purrr::imap_dbl(~ data[[.y, .x]])
}
toy_data$value <- f(toy_data, "variable")
Here are a few options that should scale well.
First is a base option that works along both the variable column and its index. (I made a copy of the data frame just so I had the original intact for more programming.)
library(dplyr)
toy2 <- toy_data
toy2$value <- mapply(function(v, i) toy_data[[v]][i], toy_data$variable, seq_along(toy_data$variable))
toy2
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Second uses purrr::imap_dbl to iterate along the variable and its index and return a double.
toy_data %>%
mutate(value = purrr::imap_dbl(variable, function(v, i) toy_data[[v]][i]))
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Third is least straightforward, but what I'd most likely use personally, maybe just because it's a process that fits many of my workflows. Pivotting makes a long version of the data, letting you see both values of variable and corresponding values of x and y, which you can then filter for where those 2 columns match. Then self-join back to the data frame.
inner_join(
toy_data,
toy_data %>%
tidyr::pivot_longer(cols = -variable, values_to = "value") %>%
filter(variable == name),
by = "variable"
) %>%
select(-name)
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Edit: #jpiversen rightly points out that the self-join won't work if variable has duplicates—in that case, add a row number to the data and use that as an additional joining column. Here I first add an additional observation to illustrate.
toy3 <- toy_data %>%
add_row(x = 5, y = 4, variable = "x") %>%
tibble::rowid_to_column()
inner_join(
toy3,
toy3 %>%
pivot_longer(cols = c(-rowid, -variable), values_to = "value") %>%
filter(variable == name),
by = c("rowid", "variable")
) %>%
select(-name, -rowid)
I would like to use pivot_longer() from {tidyr} with names_pattern to convert my data to long format while keeping the prefix string from one of the pattern matches in the column names.
This seems counter-intuitive, but I want to convert to long format before applying data dictionary cleaning steps, which requires the original column names.
Set-up
library(dplyr)
library(tidyr)
d <- tibble(id = 1,
other_var = "foo",
suffix_t1_value1 = "a",
suffix_t1_value2 = "b",
suffix_t2_value1 = "c",
suffix_t2_value2 = "d")
What I've done
> pivot_longer(d,
starts_with("suffix"),
names_pattern = "suffix_t(1|2)_(.*)",
names_to = c("rep", ".value"))
# A tibble: 2 x 5
id other_var rep value1 value2
<dbl> <chr> <chr> <chr> <chr>
1 1 foo 1 a b
2 1 foo 2 c d
Desired output
# A tibble: 2 x 5
id other_var rep suffix_t1_value1 suffix_t1_value2
<dbl> <chr> <chr> <chr> <chr>
1 1 foo 1 a b
2 1 foo 2 c d
What I've tried
Attempt 1
> pivot_longer(d,
starts_with("suffix"),
names_pattern = "suffix_t(1|2)_(.*)",
names_to = c("rep", "suffix_t1_{.value}"))
Attempt 2
> pivot_longer(d,
starts_with("suffix"),
names_pattern = "suffix_t(1|2)_(.*)",
names_to = c("rep", paste0("suffix_t1_", ".value")))
I assume you want to do it in one step within pivot_longer. I haven't figured out yet, if thats possible, but if a two step process would be ok, then the approach below should work:
library(dplyr)
library(tidyr)
d %>% pivot_longer(starts_with("suffix"),
names_pattern = "suffix_t(1|2)_(.*)",
names_to = c("rep", ".value")
) %>%
rename_with(~ gsub("(.*)", "suffix_t1_\\1", .x),
starts_with("value"))
#> # A tibble: 2 x 5
#> id other_var rep suffix_t1_value1 suffix_t1_value2
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 foo 1 a b
#> 2 1 foo 2 c d
Created on 2021-06-09 by the reprex package (v0.3.0)
Update
After digging into pivot_longer a bit, I don't think that it is possible to access .value within paste and also glue syntax {.value} does not seem to be supported.
However, {tidyr} offers the building blocks for pivoting with build_longer_spec which allows us the create our own my_pivot_longer function where we can include a names_fn argument which will apply a function to the new column names, and here we could use gsub to add a prefix or suffix.
my_pivot_longer <- function(data,
cols,
names_to = "name",
names_pattern = NULL,
names_fn = NULL) {
spec <- build_longer_spec(data,
cols,
names_pattern = names_pattern,
names_to = names_to)
if (!is.null(names_fn)) {
fn <- rlang::as_function(names_fn)
spec$.value <- fn(spec$.value)
}
pivot_longer_spec(data, spec)
}
d %>%
my_pivot_longer(starts_with("suffix"),
names_pattern = "suffix_t(1|2)_(.*)",
names_to = c("rep", ".value"),
names_fn = ~ gsub("(.*)", "suffix_t1_\\1", .x))
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(cols)` instead of `cols` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#> # A tibble: 2 x 5
#> id other_var rep suffix_t1_value1 suffix_t1_value2
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 foo 1 a b
#> 2 1 foo 2 c d
Created on 2021-06-09 by the reprex package (v0.3.0)
library(tidyverse)
df <- tibble(col1 = c(5, 2), col2 = c(6, 4), col3 = c(9, 9))
df %>% rowwise() %>% mutate(col4 = sd(c(col1, col3)))
# # A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
# 1 5 6 9 2.83
# 2 2 4 9 4.95
After asking a series of questions I can finally calculate standard deviation across rows. See my code above.
But I can't use column names in my production code, because the database I pull from likes to change the column names periodically. Lucky for me the relative column positions is always the same.
So I'll just use column numbers instead. And let's check to make sure I can just swap things in and out:
identical(df$col1, df[[1]])
# [1] TRUE
Yes, I can just swap df[[1]] in place of df$col1. I think I do it like this.
df %>% rowwise() %>% mutate(col4 = sd(c(.[[1]], .[[3]])))
# # A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
# 1 5 6 9 3.40
# 2 2 4 9 3.40
df %>% rowwise() %>% {mutate(col4 = sd(c(.[[1]], .[[3]])))}
# Error in mutate_(.data, .dots = compat_as_lazy_dots(...)) :
# argument ".data" is missing, with no default
Nope, it looks like these don't work because the results are different from my original. And I can't use apply, if you really need to know why I made a separate question.
df %>% mutate(col4 = apply(.[, c(1, 3)], 1, sd))
How do I apply dplyr rowwise() with column numbers instead of names?
The issue in using .[[1]] or .[[3]] after doing the rowwise (grouping by row - have only single row per group) is that it breaks the grouping structure and extracts the whole column. Inorder to avoid that, we can create a row_number() column before doing the rowwise and then subset the columns based on that index
library(dplyr)
df %>%
mutate(rn = row_number()) %>% # create a sequence of row index
rowwise %>%
mutate(col4 = sd(c(.[[1]][rn[1]], .[[3]][rn[1]]))) %>% #extract with index
select(-rn)
#Source: local data frame [2 x 4]
#Groups: <by row>
# A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
#1 5 6 9 2.83
#2 2 4 9 4.95
Or another option is map from purrr where we loop over the row_number() and do the subsetting of rows of dataset
library(purrr)
df %>%
mutate(col4 = map_dbl(row_number(), ~ sd(c(df[[1]][.x], df[[3]][.x]))))
# A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
#1 5 6 9 2.83
#2 2 4 9 4.95
Or another option is pmap (if we don't want to use row_number())
df %>%
mutate(col4 = pmap_dbl(.[c(1, 3)], ~ sd(c(...))))
# A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
#1 5 6 9 2.83
#2 2 4 9 4.95
Of course, the easiest way would be to use rowSds from matrixStats as described in the dupe tagged post here
NOTE: All of the above methods doesn't require any reshaping
Since you don't necessarily know the column names, but know the positions of the columns for which you need standard deviation, etc., I'd reshape into long data and add an ID column. You can gather by position instead of column name, either by giving the numbers of the column that should become the key, or the numbers of the columns to omit from the key. That way, you don't need to specify those values by column because you'll have them all in one column already. Then you can join those summary values back to your original wide-shaped data.
library(dplyr)
library(tidyr)
df <- tibble(col1 = c(5, 2), col2 = c(6, 4), col3 = c(9, 9)) %>%
mutate(id = row_number())
df %>%
mutate(id = row_number()) %>%
gather(key, value, 1, 3) %>%
group_by(id) %>%
summarise(sd = sd(value)) %>%
inner_join(df, by = "id")
#> # A tibble: 2 x 5
#> id sd col1 col2 col3
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2.83 5 6 9
#> 2 2 4.95 2 4 9
Rearrange columns by position as you need.
An approach transposing data, converting it to matrix, computing the standard deviation, transposing again and transforming into tibble.
df %>%
t %>%
rbind(col4 = c(sd(.[c(1, 3),1]), sd(.[c(1, 3),2]))) %>%
t %>%
as_tibble()
I would like to filter my tibble on several columns with similar names. Specificall I'd like to compare x with x_new, y with y_new and so on but without specifying the explicit name, but by using the structure in columns names.
I tried to use filter_at, but this is not working as I don't know how to evaluate the formula in the last line properly.
my_df %>%
filter_at(vars(contains("_new")), any_vars(funs({
x <- .
x_name <- quo_name(quo(x))
x_new_name <- str_replace(x_name, "_new", "")
paste(x_name, "!=", x_new_name)
})
))
Data
my_df <- tibble(x = 1:5,
x_new = c(1:4, 1),
y = letters[1:5],
y_new = c(letters[1:3], "a", "e"))
# A tibble: 5 x 4
# x x_new y y_new
# <int> <dbl> <chr> <chr>
# 1 1 1. a a
# 2 2 2. b b
# 3 3 3. c c
# 4 4 4. d a
# 5 5 1. e e
Expected output
# A tibble: 2 x 4
# x x_new y y_new
# <int> <dbl> <chr> <chr>
# 1 4 4. d a
# 2 5 1. e e
We could do this with map. Create a vector of unique names by removing the suffix part of the column names ('nm1'). Loop through the 'nm1', select the columns that matches the column name, reduce it to a single logical vector by checking whether the rows are not equal, then reduce the list of logical vectors to a single logical vector and extract the rows based on that
library(tidyverse)
nm1 <- unique(sub("_.*", "", names(my_df)))
map(nm1, ~ my_df %>%
select_at(vars(matches(.x))) %>%
reduce(`!=`)) %>%
reduce(`|`) %>%
magrittr::extract(my_df, ., )
# x x_new y y_new
# <int> <dbl> <chr> <chr>
#1 4 4 d a
#2 5 1 e e
Another option is to create an expression and then evaluate
library(rlang)
nm1 <- names(my_df) %>%
split(sub("_.*", "", .)) %>%
map(~ paste(.x, collapse=" != ") %>%
paste0("(", ., ")")) %>%
reduce(paste, sep = "|")
my_df %>%
filter(!! parse_expr(nm1))
# A tibble: 2 x 4
# x x_new y y_new
# <int> <dbl> <chr> <chr>
#1 4 4 d a
#2 5 1 e e