How to reshape a data table

How to reshape a data table - r

I am using R and have the next table: (example)
ID Euros N Euros N Euros N
1 A 133.911,20 451 134.208,78 450 442,03 328
2 C 9.470,35 2856 26,18 2721 26,28 2699
My desired behaivour is that you have Euros in one line and N in other line instead of columns:
ID Var1 Var2 Var3 Var4
1 A Euros 133.911,20 134.208,78 442,03
2 A N 451 450 328
3 C Euros 9.470,35 26,18 26,28
4 C N 2856 2721 2699
I have tried to do so only with A group and using the following code:
mydatatable_wide <- spread(mydatatable, Euros, N)
But I don´t get my expected result. What I get is:
ID 133.911,20 134.208,78 442,03
1 A 451 450 328

Need some work to achieve what you want - I am using dplyr & tidyr
library(dplyr)
library(tidyr)
# Here is the tribble from your question
# Note that in my language "." is decimal point and "," is thousand separate
# In R code thousand separate is not used.
df <- tribble(
~ID, ~Euros, ~N, ~Euros, ~N, ~Euros, ~N,
"A", 133911.20, 451, 134208.78, 450, 442.03, 328,
"C", 9470.35, 2856, 26.18, 2721, 26.28, 2699)
df %>%
# first convert your data set into a long version with multiple lines per ID
# contains all the numerical values Euros & N
pivot_longer(cols = where(is.numeric), names_to = "var", values_to = "value") %>%
# then split them into multiple group of Euros using group_by & group_map
group_by(var) %>%
group_map(~ {
.x %>%
group_by(ID) %>%
# in group map within each ID create a index var for those values
mutate(index_name = paste0("var_", seq(1, n(), by =1))) %>%
# then pivot them wider to have one line per ID & (Euros/N)
pivot_wider(names_from = "index_name", values_from = value, values_fill = NA)
}, .keep = TRUE) %>%
# Finally combined all the data.frame from group_map into one data.frame
bind_rows()
Output
ID var var_1 var_2 var_3
<chr> <chr> <dbl> <dbl> <dbl>
1 A Euros 133911. 134209. 442.
2 C Euros 9470. 26.2 26.3
3 A N 451 450 328
4 C N 2856 2721 2699

Related

Pivot wider dataframe with difficult structure dplyr

I was working on something I thought would be simple, but maybe today my brain isn't working. My data is like this:
tibble(metric = c('income', 'income_upp', 'income_low', 'n_house', 'n_house_upp', 'n_house_low'),
value = c(120, 140, 100, 10, 8, 12))
metric value
income 120
income_low 100
income_upp 140
n 10
n_low 8
n_upp 12
And I want to pivot_wider so it looks like this:
metric value value_low value_upp
income 120 100 140
n 10 8 12
I'm having trouble separating metrics, because pivot_wider as is, brings a dataframe that's too wide:
df %>% pivot_wider(names_from = 'metric', values_from = value)
How can I achieve this or should I pivot longer after the pivot wider?
Thanks!

I think if you convert metric into a column with "value", "value_upp" and "value_low" values, you can pivot_wider:
df %>%
mutate(param = case_when(str_detect(metric, "upp") ~ "value_upp",
str_detect(metric, "low") ~ "value_low",
TRUE ~ "value"),
metric = str_remove(metric, "_low|_upp")) %>%
pivot_wider(names_from = param, values_from = value)

I like to use separate() when I have text in a column like this. This function allows you to separate a column into multiple columns if there is a separator in the function.
In particular in this example we would want to use the arguments sep="_" and into = c("metric", "state") to convert into columns with those names.
Then mutate() and pivot_wider() can be used as you had previously specified.
library(tidyverse)
df <- tribble(~metric, ~value,
"income", 120,
"income_low", 100,
"income_upp", 140,
"n", 10,
"n_low", 8,
"n_upp", 12)
df |>
separate(metric, sep = "_", into = c("metric", "state")) |>
mutate(state = ifelse(is.na(state), "value", state)) |>
pivot_wider(id_cols = metric, names_from = state, values_from = value, names_sep = "_")
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 4].
#> # A tibble: 2 × 4
#> metric value low upp
#> <chr> <dbl> <dbl> <dbl>
#> 1 income 120 100 140
#> 2 n 10 8 12
Created on 2022-12-21 with reprex v2.0.2
Note you can use the argument names_glue or names_prefix in pivot_wider() to add the "value" as a prefix to the column names.

a data.table approach (if you can live wit the trailing underacore achter value_
library(data.table)
setDT(df)
# create some new columns based on metric
df[, c("first", "second") := tstrsplit(metric, "_")]
# metric value first second
# 1: income 120 income <NA>
# 2: income_low 100 income low
# 3: income_upp 140 income upp
# 4: n 10 n <NA>
# 5: n_low 8 n low
# 6: n_upp 12 n upp
# replace NA with ""
df[is.na(df)] <- ""
# now cast to wide, createing colnames on the fly
dcast(df, first ~ paste0("value_", second), value.var = "value")
# first value_ value_low value_upp
# 1: income 120 100 140
# 2: n 10 8 12

Convert column names into row values and find sum

I have this data frame-
input_output <- data.frame(ip_op = c('input_0', 'input_2', 'input_9', 'output_1', 'output_2', 'output_3'), a = c(1,32, 12, 246, 901, 837), b = c(284, 23, 19, 284, 9, 12), c = c(12, 8940, 379, 490, 0, 12))
ip_op a b c
1 input_0 1 284 12
2 input_2 32 23 8940
3 input_9 12 19 379
4 output_1 246 284 490
5 output_2 901 9 0
6 output_3 837 12 12
I want to create the following data frame-
input_output
type input output
1 a 45 1984
2 b 326 305
3 c 9331 502
I have tried using transpose but the column names become rownames. How do I transform this data frame?

Does this work:
> library(dplyr)
> library(tidyr)
> input_output %>% pivot_longer(-ip_op) %>% mutate(ip_op = str_extract(ip_op, ('input|output'))) %>% group_by(ip_op, name) %>% summarise(value = sum(value)) %>%
+ pivot_wider(names_from = ip_op, values_from = value) %>% rename(type = name)
`summarise()` regrouping output by 'ip_op' (override with `.groups` argument)
# A tibble: 3 x 3
type input output
<chr> <dbl> <dbl>
1 a 45 1984
2 b 326 305
3 c 9331 502
>

Try reshaping data to long, then separate the variable name to keep the desired suffix. Aggregate the values with respective groups and then reshape to wide. Here the code using tidyverse functions:
library(tidyverse)
#Code
new <- input_output %>% pivot_longer(-1) %>%
separate(ip_op,c('Var1','Var2'),sep='_') %>%
select(-Var2) %>% group_by(Var1,name) %>%
summarise(value=sum(value,na.rm = T)) %>%
pivot_wider(names_from = Var1,values_from=value)
Output:
# A tibble: 3 x 3
name input output
<chr> <dbl> <dbl>
1 a 45 1984
2 b 326 305
3 c 9331 502

Here is a dplyr and tidyr solution.
library(dplyr)
library(tidyr)
input_output %>%
mutate(ip_op = sub("_.*$", "", ip_op)) %>%
group_by(ip_op) %>%
summarise(across(a:c, ~sum(.x, na.rm = TRUE)), .groups = "keep") %>%
pivot_longer(
cols = a:c,
names_to = 'type',
values_to = 'value'
) %>%
pivot_wider(
id_cols = type,
names_from = ip_op,
values_from = value
)
## A tibble: 3 x 3
# type input output
# <chr> <dbl> <dbl>
#1 a 45 1984
#2 b 326 305
#3 c 9331 502

Here is a base R option using aggregate + reshape like below
u <- aggregate(
. ~ ip_op,
transform(
input_output,
ip_op = gsub("_\\d+", "", ip_op)
),
sum
)
reshape(
cbind(stack(u[-1]), type = u$ip_op),
direction = "wide",
idvar = "ind",
timevar = "type"
)
which gives
ind values.input values.output
1 a 45 1984
3 b 326 305
5 c 9331 502

We can use data.table
library(data.table)
dcast(melt(setDT(input_output), id.var = 'ip_op',
variable.name = 'type')[, ip_op := sub("_.*", "", ip_op)],
type ~ ip_op, value.var = 'value', sum)
-output
# type input output
#1: a 45 1984
#2: b 326 305
#3: c 9331 502
Or using transpose
data.table::transpose(setDT(input_output)[, ip_op := sub("_\\d+$", "", ip_op)][,
lapply(.SD, sum), ip_op], make.names = 'ip_op', keep.names = 'type')
-output
# type input output
#1: a 45 1984
#2: b 326 305
#3: c 9331 502
Or with tidyverse and data.table::transpose
library(dplyr)
library(stringr)
input_output %>%
group_by(ip_op = str_remove(ip_op, '_\\d+$')) %>%
summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop') %>%
data.table::transpose(make.names = 'ip_op', keep.names = 'type')
# type input output
#1 a 45 1984
#2 b 326 305
#3 c 9331 502

Compute sum and relative proportion by group for any number of columns with random names using dplyr

I want to calculate the relative proportion by group for every column - except the grouping column - of a data frame. However, this should be programmed once to be used with different data frames which will have a different number of columns with different names. Because I am relying heavily on dplyr in this project, I want to achive this with dplyr.
I have read this topic, regarding a similiar but less complex problem:
Use dynamic variable names in `dplyr`
and also vignette("programming", "dplyr") but I am still not able to set the quotation correctly. I am really stuck at this point and like to have some advice of more experienced developers.
To reproduce the problem, I have set up a minimal example with a data frame with randomly created data columns and a grouping column.
library(dplyr)
library(stringi)
df <- setNames(as.data.frame(matrix(sample(1:10, 999, replace = T), 333, 3)),
stri_rand_strings(3, 10, pattern = "[A-Za-z]"))
group <- c("group1","group2","group3")
df <- cbind(df, group)
The following function should achive two things:
calculate the sum of every column in the data frame by group
calculate the relative proportions of every column in the data frame by group
propsum <- function(df, expr){
expr_quo <- enquo(expr)
sum <- paste(quo_name(expr), "sum", sep = ".")
prop <- paste(quo_name(expr), "prop", sep = ".")
df %>%
group_by(., group) %>%
mutate(., !! sum := sum(!! expr_quo),
!! prop := expr / !! sum * 100) -> df
return(df)
}
for(i in length(df)-1){
propsum(df, names(df)[i]) -> df_new
}
The expected result is a data frame with the initial columns, the sums by group for every initial column and the relative proportions for every initial column by group. So in the example, the data frame should have 10 columns (1 goruping column, 3 initial data columns, 3 columns with sums by group, 3 columns with relative proportions by group).
However, I am getting the following error:
Error in sum(~names(df)[i]) : invalid 'type' (character) of argument
In the vignette, the code example for a similar task ist:
my_mutate <- function(df, expr) {
expr <- enquo(expr)
mean_name <- paste0("mean_", quo_name(expr))
sum_name <- paste0("sum_", quo_name(expr))
mutate(df,
!! mean_name := mean(!! expr),
!! sum_name := sum(!! expr)
)
}
my_mutate(df, a)
#> # A tibble: 5 x 6
#> g1 g2 a b mean_a sum_a
#> <dbl> <dbl> <int> <int> <dbl> <int>
#> 1 1 1 5 4 3 15
#> 2 1 2 3 2 3 15
#> 3 2 1 4 1 3 15
#> 4 2 2 1 3 3 15
#> # … with 1 more row
I tried a lot of different things as of now, but I am not able to get the RHS to use the correct column. What am I doing wrong?

I have found a solution which I just want to share in case somebody faces a similar task.
The solution is, to call rlang::parse_expr() explicitly to save the varnames as expressions.
Here is the working example:
library(dplyr)
library(stringi)
df <- setNames(as.data.frame(matrix(sample(1:10, 999, replace = T), 333, 3)),
stri_rand_strings(3, 10, pattern = "[A-Za-z]"))
group <- c("group1","group2","group3")
df <- cbind(df, group)
gpercentage <- function(df, a_var, p_var, sum_var){
df %>%
group_by(., group) %>%
mutate(., !! sum_var := sum(!! a_var),
!! p_var := !! a_var / sum(!! a_var)) -> df
return(df)
}
i <- 1
for(i in seq_along(1:(length(df)-1))){
a_var <- rlang::parse_expr(names(df)[i])
p_var <- rlang::parse_expr(paste(names(df)[i], "P", sep = "."))
sum_var <- rlang::parse_expr(paste(names(df)[i], "SUM", sep = "."))
df %>%
gpercentage(., a_var, p_var, sum_var) -> df
}

We could achieve this as follows. :
propsum <- function(df, grouping_column){
df %>%
group_by(!!sym(grouping_column)) %>%
summarise_all(list(sum,function(x)
length(x)/nrow(.) * 100)) %>%
tidyr::pivot_longer(cols=-1,
names_to = "Variable",
values_to = "Value") %>%
mutate(Variable = gsub("fn1","sum",Variable),
Variable = gsub("fn2","prop",Variable))
}
propsum(iris,"Species")
Using df in the question:
propsum(df,"group")
# A tibble: 18 x 3
group Variable Value
<fct> <chr> <dbl>
1 group1 dVFQteFGjs_sum 628
2 group1 wiQCPUeIvC_sum 599
3 group1 yBvktNXcfd_sum 644
4 group1 dVFQteFGjs_prop 33.3
5 group1 wiQCPUeIvC_prop 33.3
6 group1 yBvktNXcfd_prop 33.3
7 group2 dVFQteFGjs_sum 630
8 group2 wiQCPUeIvC_sum 606
9 group2 yBvktNXcfd_sum 656
10 group2 dVFQteFGjs_prop 33.3
11 group2 wiQCPUeIvC_prop 33.3
12 group2 yBvktNXcfd_prop 33.3
13 group3 dVFQteFGjs_sum 636
14 group3 wiQCPUeIvC_sum 581
15 group3 yBvktNXcfd_sum 635
16 group3 dVFQteFGjs_prop 33.3
17 group3 wiQCPUeIvC_prop 33.3
18 group3 yBvktNXcfd_prop 33.3
To get back to wide(can use pivot_wider, I find spread "faster" to use),
propsum(df,"group") %>%
tidyr::spread(Variable,Value)
# A tibble: 3 x 7
group dVFQteFGjs_prop dVFQteFGjs_sum wiQCPUeIvC_prop wiQCPUeIvC_sum
<fct> <dbl> <dbl> <dbl> <dbl>
1 grou~ 33.3 628 33.3 599
2 grou~ 33.3 630 33.3 606
3 grou~ 33.3 636 33.3 581
# ... with 2 more variables: yBvktNXcfd_prop <dbl>,
# yBvktNXcfd_sum <dbl>

R - Group by dplyr, and remove duplicates only if ALL members in group are duplicated

I have a large data frame many duplicates in a single column. I am trying to parse the data frame so that only one entry per duplicate remains, UNLESS all entries are duplicates. (Couldn't find any stackoverflow answers that helped with the second part...)
Example df code:
mydf <- data.frame(accession=c("A", "A", "A", "A", "B", "B", "C", "C", "D"), gene=c("unknown", "red1", "red2", "blue", "green1", "green2", "unknown", "unknown2", "violet"), ident=c(100.0, 95.3, 80.2, 65.1, 94.2, 100.0, 97.1, 90.0, 86))
df looks like this:
accession gene ident
1 A unknown 100.0
2 A red1 95.3
3 A red2 80.2
4 A blue 65.1
5 B green1 94.2
6 B green2 100.0
7 C unknown 97.1
8 C unknown2 90.0
9 D violet 86.0
And my desired output table is this:
accession gene ident
2 A red1 95.3
6 B green2 100.0
7 C unknown 97.1
8 C unknown2 90.0
Where only one unique value for accession is kept, based on having a "known" gene with the highest ident, UNLESS all duplicated entries for a particular accession contain the string unknown*.
I'm getting stuck at the last part -- keeping all rows for a duplicated accession if gene contains unknown*. This is what I have so far:
library(dplyr)
mydf$dup <- duplicated(mydf$accession, fromLast = FALSE)|duplicated(mydf$accession, fromLast = TRUE)
mydf <- mydf %>% group_by(accession) %>% mutate(count=n())
mydf <- subset.data.frame(mydf, mydf$dup == TRUE)
mydf <- mydf %>% group_by(accession) %>% filter(!grepl("unknown", gene)) %>% top_n(1,ident)
which gives:
accession gene ident dup count
2 A red1 95.3 TRUE 4
6 B green2 100.0 TRUE 2
My instinct is to do an if statement:
mydf <- mydf %>% group_by(accession) %>%
if(count(grepl("unknown", mydf$gene))!= mydf$count)
{filter(!grepl("unknown", gene))}
%>% top_n(1, ident)
but I'm running into an error:
Error in if (.) count(grepl("unknown", mydf$gene)) != mydf$count else
{ : argument is not interpretable as logical In addition: Warning
message: In if (.) count(grepl("unknown", mydf$gene)) != mydf$count
else { : the condition has length > 1 and only the first element
will be used
What's the correct solution? I'm not married to dplyr if there a better way! Thanks!

Another option:
1) firstly arrange data frame and sort unkown to the end of each group and at the same time sort ident in descending order;
2) filter per group, make sure the number of rows for the group is larger than 1, and then either the first gene starts with unknown which means the whole group contains unknown since unkown has been sorted to the end or take the first row:
mydf %>%
group_by(accession) %>%
arrange(startsWith(gene, 'unknown'), desc(ident)) %>%
filter(n() > 1 & (startsWith(first(gene), 'unknown') | row_number() == 1))
# A tibble: 4 x 3
# Groups: accession [3]
# accession gene ident
# <chr> <chr> <dbl>
#1 B green2 100.0
#2 A red1 95.3
#3 C unknown 97.1
#4 C unknown2 90.0

You could try this:
mydf %>%
group_by(accession) %>%
mutate(n = n()) %>%
filter(n > 1) %>%
mutate(ident_rnk = min_rank(ident),
ident_rnk = if_else(grepl("unknown",gene),-1L,ident_rnk)) %>%
top_n(n = 1,wt = ident_rnk) %>%
select(accession,gene,ident)

Dplyr rowwise not working on unnamed position identifiers

I'm trying to get the minimum time for each row in a dataframe. I don't know the names of the columns that I will be choosing, but I do know they will be the first to fifth columns:
data <- structure(list(Sch1 = c(99, 1903, 367),
Sch2 = c(292,248, 446),
Sch3 = c(252, 267, 465),
Sch4 = c(859, 146,360),
Sch5 = c(360, 36, 243),
Student.ID = c("Ben", "Bob", "Ali")),
.Names = c("Sch1", "Sch2", "Sch3", "Sch4", "Sch5", "Student.ID"), row.names = c(NA, 3L), class = "data.frame")
# this gets overall min for ALL rows
data %>% rowwise() %>% mutate(min_time = min(.[[1]], .[[2]], .[[3]], .[[4]], .[[5]]))
# this gets the min for EACH row
data %>% rowwise() %>% mutate(min_time = min(Sch1, Sch2, Sch3, Sch4, Sch5))
Should column notation .[[1]] return all values when in rowwise mode? I've also tried grouping on Student.ID instead of rowwise, but this doesn't make any difference

The reason column notation .[[1]] returns all values even during the grouping is is that . is not actually grouped. Basically, . is the same thing as the dataset you started with. So, when you call .[[1]], you are essentially accessing all the values in the first column.
You may have to mutate the data and add a row_number column. This allows you to index the columns you are mutating at their corresponding row numbers. The following should do:
data %>%
mutate(rn = row_number()) %>%
rowwise() %>%
mutate(min_time = min(.[[1]][rn], .[[5]][rn])) %>%
select(-rn)
Should yield:
# Sch1 Sch2 Sch3 Sch4 Sch5 Student.ID min_time
# <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
# 1 99 292 252 859 360 Ben 99
# 2 1903 248 267 146 36 Bob 36
# 3 367 446 465 360 243 Ali 243

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to reshape a data table - r

Related

Pivot wider dataframe with difficult structure dplyr

Convert column names into row values and find sum

Compute sum and relative proportion by group for any number of columns with random names using dplyr

R - Group by dplyr, and remove duplicates only if ALL members in group are duplicated

Dplyr rowwise not working on unnamed position identifiers

Categories

Resources