dplyr `pivot_longer()` object not found but it's right there? - r

library(tidyverse)
df <- tibble(Date = as.Date(c("2020-01-01", "2020-01-02")),
Shop = c("Store A", "Store B"),
Employees = c(5, 10),
Sales = c(1000, 3000))
#> # A tibble: 2 x 4
#> Date Shop Employees Sales
#> <date> <chr> <dbl> <dbl>
#> 1 2020-01-01 Store A 5 1000
#> 2 2020-01-02 Store B 10 3000
I'm switching from dplyr spread/gather to pivot_* following the dplyr reference guide. I want to gather the "Employees" and "Sales" columns in the following manner:
df %>% pivot_longer(-Date, -Shop, names_to = "Names", values_to = "Values")
#> Error in build_longer_spec(data, !!cols, names_to = names_to,
#> values_to = values_to, : object 'Shop' not found
But I'm getting this error. It seems as though I'm doing everything right. Except that I'm apparently not. Do you know what went wrong?

The cols argument is all of the columns you want to pivot. You can think of it as the complement of the id.vars argument from reshape2::melt
df %>% pivot_longer(-c(Date, Shop), names_to = "Names", values_to = "Values")
same as:
reshape2::melt(df, id.vars=c("Date", "Shop"), variable.name="Names", value.name="Value")

I think a clearer syntax would be
df %>%
pivot_longer(cols = c(Employees, Sales))
As opposed to writing the columns you want to drop.
names_to = "Names", values_to = "Values"
Is just going to capitalize the default new column names of name and value

Related

Using dplyr::summarise with dplyr::across and purrr::map to sum across columns with the same prefix

I have a data frame where I want to sum column values with the same prefix to produce a new column. My current problem is that it's not taking into account my group_by variable and returning identical values. Is part of the problem the .cols variable I'm selecting in the across function?
Sample data
library(dplyr)
library(purrr)
set.seed(10)
dat <- data.frame(id = rep(1:2, 5),
var1.pre = rnorm(10),
var1.post = rnorm(10),
var2.pre = rnorm(10),
var2.post = rnorm(10)
) %>%
mutate(index = id)
var_names = c("var1", "var2")
What I've tried
sumfunction <- map(
var_names,
~function(.){
sum(dat[glue("{.x}.pre")], dat[glue("{.x}.post")], na.rm = TRUE)
}
) %>%
setNames(var_names)
dat %>%
group_by(id) %>%
summarise(
across(
.cols = index,
.fns = sumfunction,
.names = "{.fn}"
)
) %>%
ungroup
Desired output
For this and similar problems I made the 'dplyover' package (it is not on CRAN). Here we can use dplyover::across2() to loop over two series of columns, first, all columns ending with "pre" and second all columns ending with "post". To get the names correct we can use .names = "{pre}" to get the common prefix of both series of columns.
library(dplyr)
library(dplyover) # https://timteafan.github.io/dplyover/
dat %>%
group_by(id) %>%
summarise(across2(ends_with("pre"),
ends_with("post"),
~ sum(c(.x, .y)),
.names = "{pre}"
)
)
#> # A tibble: 2 × 3
#> id var1 var2
#> <int> <dbl> <dbl>
#> 1 1 -2.32 -5.55
#> 2 2 1.11 -9.54
Created on 2022-12-14 with reprex v2.0.2
Whenever operations across multiple columns get complicated, we could pivot:
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(-c(id, index),
names_to = c(".value", "name"),
names_sep = "\\.") %>%
group_by(id) %>%
summarise(var1 = sum(var1), var2=sum(var2))
id var1 var2
<int> <dbl> <dbl>
1 1 -2.32 -5.55
2 2 1.11 -9.54

How to transform a data frame by using pivot_longer

I have the following data set
df <- data.table(
id = c(1),
field_a.x = c(10),
field_a.y = c(20),
field_b.x = c(30),
field_b.y = c(40))
And, I'd like to transform it into
df_result <- data.table(
id = c(1),
field_name = c("field_a", "field_b"),
x = c(10, 30),
y = c(20, 40))
by using "pivot_longer" function taking into account postfixes ".x" and ".y".
It will be much more fields in my real data. But I would like to see how to process it for 2 for example.
Thanks!
You could set ".value" in names_to and supply one of names_sep or names_pattern to specify how the column names should be split.
library(tidyr)
df %>%
pivot_longer(-1, names_to = c("field_name", ".value"), names_sep = "\\.")
# # A tibble: 2 × 4
# id field_name x y
# <dbl> <chr> <dbl> <dbl>
# 1 1 field_a 10 20
# 2 1 field_b 30 40
names_sep = "\\." can be replaced with names_pattern = "(.+)\\.(.+)".

How do I pivot columns?

I have found this dataframe in an Excel file, very disorganized. This is just a sample of a bigger dataset, with many jobs.
df <- data.frame(
Job = c("Frequency", "Driver", "Operator"),
Gloves = c("Daily", 1,2),
Aprons = c("Weekly", 2,0),
)
Visually it's
I need it to be in this format, something that I can work in a database:
df <- data.frame(
Job = c("Driver", "Driver", "Operator", "Operator"),
Frequency= c("Daily", "Weekly", "Daily", "Weekly"),
Item= c("Gloves", "Aprons", "Gloves", "Aprons"),
Quantity= c(1,2,2,0)
)
Visually it's
Any thoughts in how do we have to manipulate the data? I have tried without any luck.
We could use tidyverse methods by doing this in three steps
Remove the first row - slice(-1), reshape to 'long' format (pivot_longer)
Keep only the first row - slice(1), reshape to 'long' format (pivot_longer)
Do a join with both of the reshaped datasets
library(dplyr)
library(tidyr)
df %>%
slice(-1) %>%
pivot_longer(cols = -Job, names_to = 'Item',
values_to = 'Quantity') %>%
left_join(df %>%
slice(1) %>%
pivot_longer(cols= -Job, values_to = 'Frequency',
names_to = 'Item') %>%
select(-Job) )
-output
# A tibble: 4 x 4
Job Item Quantity Frequency
<chr> <chr> <chr> <chr>
1 Driver Gloves 1 Daily
2 Driver Aprons 2 Weekly
3 Operator Gloves 2 Daily
4 Operator Aprons 0 Weekly
data
df <- data.frame(
Job = c("Frequency", "Driver", "Operator"),
Gloves = c("Daily", 1,2),
Aprons = c("Weekly", 2,0))

Is there a more efficient way to handle facts which are duplicating in an R dataframe?

I have a dataframe which looks like this:
ID <- c(1,1,1,2,2,2,2,3,3,3,3)
Fact <- c(233,233,233,50,50,50,50,15,15,15,15)
Overall_Category <- c("Purchaser","Purchaser","Purchaser","Car","Car","Car","Car","Car","Car","Car","Car")
Descriptor <- c("Country", "Gender", "Eyes", "Color", "Financed", "Type", "Transmission", "Color", "Financed", "Type", "Transmission")
Members <- c("America", "Male", "Brown", "Red", "Yes", "Sedan", "Manual", "Blue","No", "Van", "Automatic")
df <- data.frame(ID, Fact, Overall_Category, Descriptor, Members)
The dataframes dimensions work like this:
There will always be an ID/key which singularly and uniquely identifies a submitted fact
There will always be a dimension for a given fact defining the Overall_Category of which a submitted fact belongs.
Most of the time - but not always - there will be a dimension for a "Descriptor",
If there is a "Descriptor" dimension for a given fact, there will be another "Members" dimension to show possible members within "Descriptor".
The problem is that a single submitted fact is duplicated for a given ID based on how many dimensions apply to the given fact. What I'd like is a way to show the fact only once, based on its ID, and have the applicable dimensions stored against that single ID.
I've achieved it by doing this:
df1 <- pivot_wider(df,
id_cols = ID,
names_from = c(Overall_Category, Descriptor, Members),
names_prefix = "zzzz",
values_from = Fact,
names_sep = "-",
names_repair = "unique")
ColumnNames <- df1 %>% select(matches("zzzz")) %>% colnames()
df2 <- df1 %>% mutate(mean_sel = rowMeans(select(., ColumnNames), na.rm = T))
df3 <- df2 %>% mutate_at(ColumnNames, function(x) ifelse(!is.na(x), deparse(substitute(x)), NA))
df3 <- df3 %>% unite('Descriptor', ColumnNames, na.rm = T, sep = "_")
df3 <- df3 %>% mutate_at("Descriptor", str_replace_all, "zzzz", "")
But it seems like it wouldn't scale well for facts with many dimensions due to the pivot_wide, and in general doesn't seem like a very efficient approach.
Is there a better way to do this?
You can unite the columns and for each ID combine them together and take average of Fact values.
library(dplyr)
library(tidyr)
df %>%
unite(Descriptor, Overall_Category:Members, sep = '-', na.rm = TRUE) %>%
group_by(ID) %>%
summarise(Descriptor = paste0(Descriptor, collapse = '_'),
mean_sel = mean(Fact, na.rm = TRUE))
# ID Descriptor mean_sel
# <dbl> <chr> <dbl>
#1 1 Purchaser-Country-America_Purchaser-Gender-Male_Purchas… 233
#2 2 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Trans… 50
#3 3 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmi… 15
I think you want simple paste with sep and collapse arguments
library(dplyr, warn.conflicts = F)
df %>% group_by(ID, Fact) %>%
summarise(Descriptor = paste(paste(Overall_Category, Descriptor, Members, sep = '-'), collapse = '_'), .groups = 'drop')
# A tibble: 3 x 3
ID Fact Descriptor
<dbl> <dbl> <chr>
1 1 233 Purchaser-Country-America_Purchaser-Gender-Male_Purchaser-Eyes-Brown
2 2 50 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Transmission-Manual
3 3 15 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmission-Automatic
An option with str_c
library(dplyr)
library(stringr)
df %>%
group_by(ID, Fact) %>%
summarise(Descriptor = str_c(Overall_Category, Descriptor, Members, sep= "-", collapse="_"), .groups = 'drop')

Pivot data from wide to long using column suffix to get table with multiple columns with values (using pivot_longer)

I have a tibble/dataframe that looks like this:
hc_inpatient_sum hc_ambulant_sum hc_inpatient_mean hc_ambulant_mean
5 2 5.5 2.2
My desired output is:
my_names sum mean
hc_inpatient 5 5.5
hc_ambulant 2 2.2
I get what I want using the following code. However, it seems pretty complicated. I guess that the same result could be obtained using less complicated code.
library(dplyr)
library(tidyr)
my_data <- tibble(hc_inpatient_sum = 5, hc_ambulant_sum = 2, hc_inpatient_mean = 5.5,
hc_ambulant_mean = 2.2)
res <- my_data %>%
pivot_longer(cols = everything(), names_to = "my_names", values_to = "my_values") %>%
separate(my_names, into = c("my_names", "stats"), sep = "_(?=[^_]+$)") %>%
pivot_wider(names_from = "stats", values_from = "my_values")
Is there a more direct way to get the same result using tidyr::pivot_longer?
Alternatively I could do something like this...
res2 <- pivot_longer(my_data, cols = everything(),
names_to = c(".value", "stats"),
names_pattern = "(.*)_(.*)") %>%
t()
colnames(res2) <- res2["stats",]
res2 <- as_tibble(res2[-1,], rownames = "my_names") %>%
mutate_at(vars(-my_names), as.double)
... but that is even more awkward.
You can do this in one with...
df %>% pivot_longer(everything(),
names_to = c("my_names", ".value"),
names_pattern = "(.+)_(.+$)")
my_names sum mean
<chr> <int> <dbl>
1 hc_inpatient 5 5.5
2 hc_ambulant 2 2.2
These examples are quite helpful for getting the hang of pivot_longer https://tidyr.tidyverse.org/reference/pivot_longer.html

Resources