Use pivot_longer to cast data to long with repeated column names - r

I have a df that is will be of nonfinite length. Example below only has 2 traits: "density" and "lipids", but other dfs may have 50 or more traits. Each trait has 3 columns associated with it: value.trait, unit.trait, method.trait. Seems very similiar to this example in vignette But when I run the code below I keep getting an error: Input must be a vector, not NULL
3 rows of sample data
x <- structure(list(geno_name = c("MB mixed", "MB mixed", "MB mixed"
), study_location = c("lab", "lab", "lab"), author = c("test",
"test", "test"), value.lipids = c(NA, 2.361463603, 1.461189384
), unit.lipids = c(NA, "g cm^-2", "g cm^-2"), method.lipids = c(NA,
"airbrush", "airbrush"), value.density = c(1.125257337, 0.816034359,
0.49559013), unit.density = c("g cm^-3", "g cm^-3", "g cm^-3"
), method.density = c("3D scanning", "3D scanning", "3D scanning"
)), row.names = c(NA, 3L), class = "data.frame")
Current pivot code:
x %>%
select(!c(study_location, author)) %>%
pivot_longer(cols = !geno_name,
names_to = c(".value", "trait"),
names_sep = ".",
values_drop_na = TRUE)
Error code:
Error: Input must be a vector, not NULL. Run rlang::last_error() to
see where the error occurred. In addition: Warning messages: 1: In
gsub(paste0("^", names_prefix), "", names(cols)) : argument
'pattern' has length > 1 and only the first element will be used 2:
Expected 2 pieces. Additional pieces discarded in 6 rows [1, 2, 3, 4,
5, 6].

We can also do
tidyr::pivot_longer(x,
cols = c(lipids, density),
names_to = c('.value', 'trait'),
names_sep = '[.]',
values_drop_na = TRUE)

Here's an approach that first makes the data longer, then splits out traits from unit/method, then spreads those.
x %>%
janitor::clean_names() %>% # This makes the column names distinct with #s
pivot_longer(cols = -(1:2),
names_to = "var",
values_to = "val",
values_transform = list(val = as.character)) %>%
mutate(trait = if_else(str_detect(var, "unit|method", negate = TRUE),
var, NA_character_),
# the regex below is meant to remove everything starting with _
stat = if_else(is.na(trait), var %>% str_remove("\\_[^.]*$"), "value")) %>%
fill(trait) %>%
select(-var) %>%
pivot_wider(names_from = stat, values_from = val)
# A tibble: 2 x 6
geno_name observation_id trait value unit method
<chr> <dbl> <chr> <chr> <chr> <chr>
1 MB mixed 10 lipids NA NA NA
2 MB mixed 10 density 1.125 g cm^-3 3D scanning

You can use pivot_longer as :
tidyr::pivot_longer(x,
cols = matches('lipids|density'),
names_to = c('.value', 'trait'),
names_sep = '\\.',
values_drop_na = TRUE)
# geno_name study_location author trait value unit method
# <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
#1 MB mixed lab test density 1.13 g cm^-3 3D scanning
#2 MB mixed lab test lipids 2.36 g cm^-2 airbrush
#3 MB mixed lab test density 0.816 g cm^-3 3D scanning
#4 MB mixed lab test lipids 1.46 g cm^-2 airbrush
#5 MB mixed lab test density 0.496 g cm^-3 3D scanning

Related

How to combine dplyr group_by, summarise, across and multiple function outputs?

I have the following tibble:
tTest = tibble(Cells = rep(c("C1", "C2", "C3"), times = 3),
Gene = rep(c("G1", "G2", "G3"), each = 3),
Experiment_score = 1:9,
Pattern1 = 1:9,
Pattern2 = -(1:9),
Pattern3 = 9:1) %>%
group_by(Gene)
and I would like to correlate the Experiment_score with each of the Pattern columns for all Gene.
Looking at the tidyverse across page and examples, I thought this would work:
# `corList` is a simple wrapper for `cor` to have exactly two outputs:
corList = function(x, y) {
result = cor.test(x, y)
return(list(stat = result$estimate, pval = result$p.value))
}
tTest %>% summarise(across(starts_with("Pattern"), ~ corList(Experiment_score, .x), .names = "{.col}_corr_{.fn}"))
but I got this:
I have found a solution by melting the Pattern columns and I will post it down below for completeness but the challenge is that I have dozens of Pattern columns and millions of rows. If I melt the Pattern columns, I end up with half a billion rows, seriously hampering my ability to work with the data.
EDIT:
My own imperfect solution:
# `corVect` is a simple wrapper for `cor` to have exactly two outputs:
corVect = function(x, y) {
result = cor.test(x, y)
return(c(stat = result$estimate, pval = result$p.value))
}
tTest %>% pivot_longer(starts_with("Pattern"), names_to = "Pattern", values_to = "Strength") %>%
group_by(Gene, Pattern) %>%
summarise(CorrVal = corVect(Experiment_score, Strength)) %>%
mutate(CorrType = c("corr", "corr_pval")) %>%
# Reformat
pivot_wider(id_cols = c(Gene, Pattern), names_from = CorrType, values_from = CorrVal)
To get the desired result in one step, wrap the function return as a tibble rather than a list, and call .unpack = TRUE in across. Here using a conveniently-named corTibble function:
library(tidyverse)
tTest = tibble(
Cells = rep(c("C1", "C2", "C3"), times = 3),
Gene = rep(c("G1", "G2", "G3"), each = 3),
Experiment_score = 1:9,
Pattern1 = 1:9 + rnorm(9), # added some noise
Pattern2 = -(1:9 + rnorm(9)),
Pattern3 = 9:1 + rnorm(9)
) %>%
group_by(Gene)
corTibble = function(x, y) {
result = cor.test(x, y)
return(tibble(stat = result$estimate, pval = result$p.value))
}
tTest %>% summarise(across(
starts_with("Pattern"),
~ corTibble(Experiment_score, .x),
.names = "{.col}_corr",
.unpack = TRUE
))
#> # A tibble: 3 × 7
#> Gene Pattern1_corr_stat Pattern1_corr_pval Pattern2…¹ Patte…² Patte…³ Patte…⁴
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 G1 0.947 0.208 -0.991 0.0866 -1.00 0.0187
#> 2 G2 0.964 0.172 -0.872 0.325 -0.981 0.126
#> 3 G3 0.995 0.0668 -0.680 0.524 -0.409 0.732
#> # … with abbreviated variable names ¹​Pattern2_corr_stat, ²​Pattern2_corr_pval,
#> # ³​Pattern3_corr_stat, ⁴​Pattern3_corr_pval

How to transform a data frame by using pivot_longer

I have the following data set
df <- data.table(
id = c(1),
field_a.x = c(10),
field_a.y = c(20),
field_b.x = c(30),
field_b.y = c(40))
And, I'd like to transform it into
df_result <- data.table(
id = c(1),
field_name = c("field_a", "field_b"),
x = c(10, 30),
y = c(20, 40))
by using "pivot_longer" function taking into account postfixes ".x" and ".y".
It will be much more fields in my real data. But I would like to see how to process it for 2 for example.
Thanks!
You could set ".value" in names_to and supply one of names_sep or names_pattern to specify how the column names should be split.
library(tidyr)
df %>%
pivot_longer(-1, names_to = c("field_name", ".value"), names_sep = "\\.")
# # A tibble: 2 × 4
# id field_name x y
# <dbl> <chr> <dbl> <dbl>
# 1 1 field_a 10 20
# 2 1 field_b 30 40
names_sep = "\\." can be replaced with names_pattern = "(.+)\\.(.+)".

Stack data (maybe pivot_longer) but complicated, R

I have data like this:
df<-structure(list(record_id = c(1, 2, 4), alcohol = c(1, 2, 1),
ethnicity = c(1, 1, 1), bilateral_vs_unilateral = c(1, 2,
2), fat_grafting = c(1, 1, 0), number_of_adm_sheets_used = c(1,
NA, NA), number_of_adm_sheets_used_2 = c(1, 1, 1), number_of_fills = c(7,
NA, NA), number_of_fills_2 = c(7, NA, 2), total_fill_volume_ml_left = c(240,
NA, NA), total_volume_ml = c(240, 300, 550), implant_size_l = c(NA_real_,
NA_real_, NA_real_), implant_size_l_2 = c(NA_real_, NA_real_,
NA_real_)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
It is info about patients with each row representing a patient that underwent breast surgery.
I'd like to change it into each row representing a particular breast (of the two). There are several variables, everything from 'number_of_adm_sheets_used' to 'implant_size_l_2' that have a column for each side. I'd like to change those to represent either. An example is 'number_of_adm_sheets_used' stood for on the left side, and 'number_of_adm_sheets_used_2' was on the right side. I'd like to combine them to become one column of sheets used that was for either side.
My expected output would look like:
Pre-
Post-
I figure its some variant of pivot_longer but I'm having trouble with a few aspects:
the real data has 68 columns
I only need a duplicate row if the column "bilateral_vs_unilateral" is a "1" (meaning bilateral)
The way I've used pivot_longer before, you'd say "cols" and pick a big range, I'm not sure how to stack pairs of columns, if that makes sense.
Luckily, despite having 68 other columns, all of the "trouble" columns are shown below. Pairing 'number_of_adm_sheets_used' with 'number_of_adm_sheets_used_2'
'number_of_fills' with 'number_of_fills_2'
'total_fill_volume_ml_left' with 'total_volume_ml'
and 'implant_size_1' with 'implant_size_1_2'
Thank you
Here is one possibility, if I'm understanding the issue correctly.
# Make long format
df.long <- df %>%
pivot_longer(cols = -record_id) %>%
mutate(subject = ifelse(str_sub(name, -2, -1) == "_2", "breast 2", NA),
name = str_remove(name, "_2")) %>%
group_by(record_id, name) %>%
mutate(subject = case_when(
subject == "breast 2" ~ subject,
n() == 2 ~ "breast 1",
n() == 1 ~ "patient"
)) %>%
ungroup()
# statistics regarding the patient
patient <- df.long %>%
filter(subject == "patient") %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-subject)
# statistics regarding each breast
breasts <- df.long %>%
filter(str_detect(subject, "breast")) %>%
pivot_wider(names_from = name, values_from = value)
# merge the two data.frames
patient %>%
inner_join(breasts) %>%
select(record_id, subject, everything())
If you rename your "trouble columns" to a consistent pattern, then you can use pivot_longer()'s names_pattern argument and ".value" sentinel to pull pairs of values into rows. In my example code, I suffixed these with "_l" or "_r" for left- and right-sided variants. We can use the values_drop_na argument to keep only the valid rows for unilateral cases.
I also changed alcohol to a factor, just to demonstrate that it doesn't throw the error you noted in the bounty.
library(tidyverse)
df_long <- df %>%
mutate(alcohol = factor(alcohol)) %>%
rename(
number_of_adm_sheets_used_l = number_of_adm_sheets_used,
number_of_adm_sheets_used_r = number_of_adm_sheets_used_2,
number_of_fills_l = number_of_fills,
number_of_fills_r = number_of_fills_2,
total_fill_volume_ml_l = total_fill_volume_ml_left,
total_fill_volume_ml_r = total_volume_ml,
implant_size_l = implant_size_l,
implant_size_r = implant_size_l_2
) %>%
pivot_longer(
cols = ends_with(c("_l", "_r")),
names_to = c(".value", "side"),
names_pattern = "(.+)_(l|r)",
values_drop_na = TRUE
)
Output:
### move pivoted columns up front for illustration purposes
df_long %>%
relocate(record_id, side, number_of_adm_sheets_used:implant_size)
# A tibble: 4 x 10
record_id side number_of_adm_sheets_used number_of_fills total_fill_volume~
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 l 1 7 240
2 1 r 1 7 240
3 2 r 1 NA 300
4 4 r 1 2 550
# ... with 5 more variables: implant_size <dbl>, alcohol <fct>,
# ethnicity <dbl>, bilateral_vs_unilateral <dbl>, fat_grafting <dbl>

i want to write a custom function with tidyverse verbs/syntax that accepts the grouping parameters of my function as string

I want to write a function that has as parameters a data set, a variable to be grouped, and another parameter to be filtered. I want to write the function in such a way that I can afterwards apply map() to it and pass the variables to be grouped in to map() as a vector. Nevertheless, I don't know how my custom function rating() accepts the variables to be grouped as a string. This is what i have tried.
data = tibble(a = seq.int(1:10),
g1 = c(rep("blue", 3), rep("green", 3), rep("red", 4)),
g2 = c(rep("pink", 2), rep("hotpink", 6), rep("firebrick", 2)),
na = NA,
stat=c(23,43,53,2,43,18,54,94,43,87))
rating = function(data, by, no){
data %>%
select(a, {{by}}, stat) %>%
group_by({{by}}) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
fn(data = data, by = g2, no = 5) #this works
And this is the way i want to use my function
map(.x = c("g1", "g2"), .f = ~rating(data = data, by = .x, no = 1))
... but i get
Error: Must group by variables found in `.data`.
* Column `.x` is not found.
As we are passing character elements, it would be better to convert to symbol and evaluate (!!)
library(dplyr)
library(purrr)
rating <- function(data, by, no){
by <- rlang::ensym(by)
data %>%
select(a, !! by, stat) %>%
group_by(!!by) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
-testing
> map(.x = c("g1", "g2"), .f = ~rating(data = data, by = !!.x, no = 1))
[[1]]
# A tibble: 1 × 4
a g1 stat rank
<int> <chr> <dbl> <dbl>
1 1 blue 23 1
[[2]]
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 1 pink 23 1
It also works with unquoted input
> rating(data, by = g2, no = 5)
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 5 hotpink 43 3

dplyr `pivot_longer()` object not found but it's right there?

library(tidyverse)
df <- tibble(Date = as.Date(c("2020-01-01", "2020-01-02")),
Shop = c("Store A", "Store B"),
Employees = c(5, 10),
Sales = c(1000, 3000))
#> # A tibble: 2 x 4
#> Date Shop Employees Sales
#> <date> <chr> <dbl> <dbl>
#> 1 2020-01-01 Store A 5 1000
#> 2 2020-01-02 Store B 10 3000
I'm switching from dplyr spread/gather to pivot_* following the dplyr reference guide. I want to gather the "Employees" and "Sales" columns in the following manner:
df %>% pivot_longer(-Date, -Shop, names_to = "Names", values_to = "Values")
#> Error in build_longer_spec(data, !!cols, names_to = names_to,
#> values_to = values_to, : object 'Shop' not found
But I'm getting this error. It seems as though I'm doing everything right. Except that I'm apparently not. Do you know what went wrong?
The cols argument is all of the columns you want to pivot. You can think of it as the complement of the id.vars argument from reshape2::melt
df %>% pivot_longer(-c(Date, Shop), names_to = "Names", values_to = "Values")
same as:
reshape2::melt(df, id.vars=c("Date", "Shop"), variable.name="Names", value.name="Value")
I think a clearer syntax would be
df %>%
pivot_longer(cols = c(Employees, Sales))
As opposed to writing the columns you want to drop.
names_to = "Names", values_to = "Values"
Is just going to capitalize the default new column names of name and value

Resources