I have the following data set
df <- data.table(
id = c(1),
field_a.x = c(10),
field_a.y = c(20),
field_b.x = c(30),
field_b.y = c(40))
And, I'd like to transform it into
df_result <- data.table(
id = c(1),
field_name = c("field_a", "field_b"),
x = c(10, 30),
y = c(20, 40))
by using "pivot_longer" function taking into account postfixes ".x" and ".y".
It will be much more fields in my real data. But I would like to see how to process it for 2 for example.
Thanks!
You could set ".value" in names_to and supply one of names_sep or names_pattern to specify how the column names should be split.
library(tidyr)
df %>%
pivot_longer(-1, names_to = c("field_name", ".value"), names_sep = "\\.")
# # A tibble: 2 × 4
# id field_name x y
# <dbl> <chr> <dbl> <dbl>
# 1 1 field_a 10 20
# 2 1 field_b 30 40
names_sep = "\\." can be replaced with names_pattern = "(.+)\\.(.+)".
Related
I have a data frame where I want to sum column values with the same prefix to produce a new column. My current problem is that it's not taking into account my group_by variable and returning identical values. Is part of the problem the .cols variable I'm selecting in the across function?
Sample data
library(dplyr)
library(purrr)
set.seed(10)
dat <- data.frame(id = rep(1:2, 5),
var1.pre = rnorm(10),
var1.post = rnorm(10),
var2.pre = rnorm(10),
var2.post = rnorm(10)
) %>%
mutate(index = id)
var_names = c("var1", "var2")
What I've tried
sumfunction <- map(
var_names,
~function(.){
sum(dat[glue("{.x}.pre")], dat[glue("{.x}.post")], na.rm = TRUE)
}
) %>%
setNames(var_names)
dat %>%
group_by(id) %>%
summarise(
across(
.cols = index,
.fns = sumfunction,
.names = "{.fn}"
)
) %>%
ungroup
Desired output
For this and similar problems I made the 'dplyover' package (it is not on CRAN). Here we can use dplyover::across2() to loop over two series of columns, first, all columns ending with "pre" and second all columns ending with "post". To get the names correct we can use .names = "{pre}" to get the common prefix of both series of columns.
library(dplyr)
library(dplyover) # https://timteafan.github.io/dplyover/
dat %>%
group_by(id) %>%
summarise(across2(ends_with("pre"),
ends_with("post"),
~ sum(c(.x, .y)),
.names = "{pre}"
)
)
#> # A tibble: 2 × 3
#> id var1 var2
#> <int> <dbl> <dbl>
#> 1 1 -2.32 -5.55
#> 2 2 1.11 -9.54
Created on 2022-12-14 with reprex v2.0.2
Whenever operations across multiple columns get complicated, we could pivot:
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(-c(id, index),
names_to = c(".value", "name"),
names_sep = "\\.") %>%
group_by(id) %>%
summarise(var1 = sum(var1), var2=sum(var2))
id var1 var2
<int> <dbl> <dbl>
1 1 -2.32 -5.55
2 2 1.11 -9.54
I want to write a function that has as parameters a data set, a variable to be grouped, and another parameter to be filtered. I want to write the function in such a way that I can afterwards apply map() to it and pass the variables to be grouped in to map() as a vector. Nevertheless, I don't know how my custom function rating() accepts the variables to be grouped as a string. This is what i have tried.
data = tibble(a = seq.int(1:10),
g1 = c(rep("blue", 3), rep("green", 3), rep("red", 4)),
g2 = c(rep("pink", 2), rep("hotpink", 6), rep("firebrick", 2)),
na = NA,
stat=c(23,43,53,2,43,18,54,94,43,87))
rating = function(data, by, no){
data %>%
select(a, {{by}}, stat) %>%
group_by({{by}}) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
fn(data = data, by = g2, no = 5) #this works
And this is the way i want to use my function
map(.x = c("g1", "g2"), .f = ~rating(data = data, by = .x, no = 1))
... but i get
Error: Must group by variables found in `.data`.
* Column `.x` is not found.
As we are passing character elements, it would be better to convert to symbol and evaluate (!!)
library(dplyr)
library(purrr)
rating <- function(data, by, no){
by <- rlang::ensym(by)
data %>%
select(a, !! by, stat) %>%
group_by(!!by) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
-testing
> map(.x = c("g1", "g2"), .f = ~rating(data = data, by = !!.x, no = 1))
[[1]]
# A tibble: 1 × 4
a g1 stat rank
<int> <chr> <dbl> <dbl>
1 1 blue 23 1
[[2]]
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 1 pink 23 1
It also works with unquoted input
> rating(data, by = g2, no = 5)
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 5 hotpink 43 3
library(tidyverse)
df <- tibble(Date = as.Date(c("2020-01-01", "2020-01-02")),
Shop = c("Store A", "Store B"),
Employees = c(5, 10),
Sales = c(1000, 3000))
#> # A tibble: 2 x 4
#> Date Shop Employees Sales
#> <date> <chr> <dbl> <dbl>
#> 1 2020-01-01 Store A 5 1000
#> 2 2020-01-02 Store B 10 3000
I'm switching from dplyr spread/gather to pivot_* following the dplyr reference guide. I want to gather the "Employees" and "Sales" columns in the following manner:
df %>% pivot_longer(-Date, -Shop, names_to = "Names", values_to = "Values")
#> Error in build_longer_spec(data, !!cols, names_to = names_to,
#> values_to = values_to, : object 'Shop' not found
But I'm getting this error. It seems as though I'm doing everything right. Except that I'm apparently not. Do you know what went wrong?
The cols argument is all of the columns you want to pivot. You can think of it as the complement of the id.vars argument from reshape2::melt
df %>% pivot_longer(-c(Date, Shop), names_to = "Names", values_to = "Values")
same as:
reshape2::melt(df, id.vars=c("Date", "Shop"), variable.name="Names", value.name="Value")
I think a clearer syntax would be
df %>%
pivot_longer(cols = c(Employees, Sales))
As opposed to writing the columns you want to drop.
names_to = "Names", values_to = "Values"
Is just going to capitalize the default new column names of name and value
I am using the output coefficients from a glm regression model and I need to create a lookup value, using key paste ([column name].[Factor Level], and then return the corresponding value from another data table. The column names must be dynamic so that I don't have to explicitly name each column one by one.
The returned values from the lookup are then multiplied by 1 (for factors) or by the actual numeric values and all coef_colnames summed into column Total.
I've done some example in excel but cannot replicate it in R.
var_Factor1 combines the column name and the factor level from each row (using paste) to build a key for the next step lookup
var_Number1 is just the column name as it is numeric and has no factor levels
library(dplyr)
# original data
dt = data.table(
Factor1 = c("A","B","C"),
Number1 = c(10, 20,40),
Factor2 = c("D","H","N"),
Number2 = c(2, 5,3)
)
# Lookup table
model_coef = data.table(
Factor1.A = 10,
Factor1.B = 20,
Factor1.C = 30,
Factor2.D = 40,
Factor2.H = 50,
Factor2.N = 60,
Number1 = 200,
Number2 = 500
)
#initial steps
dt <- dt %>% mutate (
var_Factor1 = paste("Factor1", Factor1, sep =".")
, var_Number1 = "Number1"
, var_Factor2 = paste("Factor2", Factor2, sep =".")
, var_Number2 = "Number2"
) %>% mutate (
coef_Factor1 = model_coef[,var_Factor1]
)
#The final output should produce (as replicated from Excel)
final_output = data.table (
Factor1= c("A", "B", "C"),
Number1= c(10, 20, 40),
Factor2= c("D", "H", "N"),
Number2= c(2, 5, 3),
var_Factor1= c("Factor1.A", "Factor1.B", "Factor1.C"),
var_Number1= c("Number1", "Number1", "Number1"),
var_Factor2= c("Factor2.D", "Factor2.H", "Factor2.N"),
var_Number2= c("Number2", "Number2", "Number2"),
coef_Factor1= c(10, 20, 30),
coef_Number1= c(200, 200, 200),
coef_Factor2= c(40, 50, 60),
coef_Number2= c(500, 500, 500),
calc_Factor1= c(10, 20, 30),
calc_Number1= c(2000, 4000, 8000),
calc_Factor2= c(40, 50, 60),
calc_Number2= c(1000, 2500, 1500),
Total= c(3050, 6570, 9590)
)
It's generally a bad idea to try to generate and manipulate dynamic columns.
It will probably be better to use tidy data conventions and make the data "long". Also, it looks like you're trying to mix data.table and dplyr/tidyverse. In particular, this doesn't work: mutate (coef_Factor1 = model_coef[,var_Factor1]
I've tidied your data and modified your code to use dplyr/tidyverse below:
using tibble instead of data.table
re-built lookup table to tidy-long format so it can be left_joined
properly to your table
used mutate to do the calculations that you describe
Beyond your example, if you have more than 2 "Numbers"/"Factors" (your naming/labeling/numbering is confusing btw), there are ways to generalize further so that the code multiplies coef * number generically, for each "number"/combination. Also, your data implies but it isn't clear that A is related to D, B is related to H, etc.
library(tidyverse)
data <- tibble(Factor1 = c("A","B","C"),Number1 = c(10, 20,40),Factor2 = c("D","H","N"),Number2 = c(2, 5,3))
model_coef <- tibble(Factor1.A = 10,Factor1.B = 20,Factor1.C = 30,Factor2.D = 40,Factor2.H = 50,Factor2.N = 60,Number1 = 200,Number2 = 500)
(model_coef_factor1 <- model_coef %>%
select(Factor1.A:Factor1.C) %>%
pivot_longer(cols = everything(), names_to = c("number", "factor"), names_sep = "[.]", values_to = "coef_factor1") %>%
select(-number))
#> # A tibble: 3 x 2
#> factor coef_factor1
#> <chr> <dbl>
#> 1 A 10
#> 2 B 20
#> 3 C 30
(model_coef_factor2 <- model_coef %>%
select(Factor2.D:Factor2.N) %>%
pivot_longer(cols = everything(), names_to = c("number", "factor"), names_sep = "[.]", values_to = "coef_factor2") %>%
select(-number))
#> # A tibble: 3 x 2
#> factor coef_factor2
#> <chr> <dbl>
#> 1 D 40
#> 2 H 50
#> 3 N 60
(final_output <- data %>%
left_join(model_coef_factor1, by = c("Factor1"="factor")) %>%
left_join(model_coef_factor2, by = c("Factor2"="factor")) %>%
mutate(coef_number1 = model_coef$Number1,
coef_number2 = model_coef$Number2,
calc_factor1 = coef_factor1,
calc_number1 = Number1 * coef_number1,
calc_factor2 = coef_factor2,
calc_number2 = Number2 * coef_number2,
total = calc_factor1 + calc_number1 + calc_factor2 + calc_number2) %>%
select(total, everything()))
#> # A tibble: 3 x 13
#> total Factor1 Number1 Factor2 Number2 coef_factor1 coef_factor2
#> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 3050 A 10 D 2 10 40
#> 2 6570 B 20 H 5 20 50
#> 3 9590 C 40 N 3 30 60
#> # ... with 6 more variables: coef_number1 <dbl>, coef_number2 <dbl>,
#> # calc_factor1 <dbl>, calc_number1 <dbl>, calc_factor2 <dbl>,
#> # calc_number2 <dbl>
Created on 2019-10-23 by the reprex package (v0.3.0)
I did a sum of a column using the code below.
I have the correct number but it is not formatted properly as a number. I also have a case where I need it formatted as currency. This is the code I've tried
Result %>%
summarise(Pieces_Mailed = sum(Households, na.rm = TRUE)) %>%
comma_format(digits = 12)
first case: it gave me 520698. How do i get it to return 520,698 instead?
second case: it gave me 46553549. How do i get it to return $4,655,354 instead?
Thanks.
comma_format just returns ,
library(dplyr)
library(scales)
library(tibble)
tibble(col = sample(1e5, 10, replace = FALSE)) %>%
summarise(col = sum(col)) %>%
mutate(col = comma_format(accuracy = 12)(col))
# A tibble: 1 x 1
# col
# <chr>
#1 481,296
For adding $, we need dollar_format
tibble(col = sample(1e5, 10, replace = FALSE)) %>%
summarise(col = sum(col)) %>%
mutate(col = dollar_format(accuracy = 12)(col))
# A tibble: 1 x 1
# col
# <chr>
#1 $445,896