Pivot wider dataframe with difficult structure dplyr - r

I was working on something I thought would be simple, but maybe today my brain isn't working. My data is like this:
tibble(metric = c('income', 'income_upp', 'income_low', 'n_house', 'n_house_upp', 'n_house_low'),
value = c(120, 140, 100, 10, 8, 12))
metric value
income 120
income_low 100
income_upp 140
n 10
n_low 8
n_upp 12
And I want to pivot_wider so it looks like this:
metric value value_low value_upp
income 120 100 140
n 10 8 12
I'm having trouble separating metrics, because pivot_wider as is, brings a dataframe that's too wide:
df %>% pivot_wider(names_from = 'metric', values_from = value)
How can I achieve this or should I pivot longer after the pivot wider?
Thanks!

I think if you convert metric into a column with "value", "value_upp" and "value_low" values, you can pivot_wider:
df %>%
mutate(param = case_when(str_detect(metric, "upp") ~ "value_upp",
str_detect(metric, "low") ~ "value_low",
TRUE ~ "value"),
metric = str_remove(metric, "_low|_upp")) %>%
pivot_wider(names_from = param, values_from = value)

I like to use separate() when I have text in a column like this. This function allows you to separate a column into multiple columns if there is a separator in the function.
In particular in this example we would want to use the arguments sep="_" and into = c("metric", "state") to convert into columns with those names.
Then mutate() and pivot_wider() can be used as you had previously specified.
library(tidyverse)
df <- tribble(~metric, ~value,
"income", 120,
"income_low", 100,
"income_upp", 140,
"n", 10,
"n_low", 8,
"n_upp", 12)
df |>
separate(metric, sep = "_", into = c("metric", "state")) |>
mutate(state = ifelse(is.na(state), "value", state)) |>
pivot_wider(id_cols = metric, names_from = state, values_from = value, names_sep = "_")
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 4].
#> # A tibble: 2 × 4
#> metric value low upp
#> <chr> <dbl> <dbl> <dbl>
#> 1 income 120 100 140
#> 2 n 10 8 12
Created on 2022-12-21 with reprex v2.0.2
Note you can use the argument names_glue or names_prefix in pivot_wider() to add the "value" as a prefix to the column names.

a data.table approach (if you can live wit the trailing underacore achter value_
library(data.table)
setDT(df)
# create some new columns based on metric
df[, c("first", "second") := tstrsplit(metric, "_")]
# metric value first second
# 1: income 120 income <NA>
# 2: income_low 100 income low
# 3: income_upp 140 income upp
# 4: n 10 n <NA>
# 5: n_low 8 n low
# 6: n_upp 12 n upp
# replace NA with ""
df[is.na(df)] <- ""
# now cast to wide, createing colnames on the fly
dcast(df, first ~ paste0("value_", second), value.var = "value")
# first value_ value_low value_upp
# 1: income 120 100 140
# 2: n 10 8 12

Related

How to sum values from one column based on specific conditions from other column in R?

I have a dataset that looks something like this:
df <- data.frame(plot = c("A","A","A","A","A","B","B","B","B","B","C","C","C","C","C"),
species = c("Fagus","Fagus","Quercus","Picea", "Abies","Fagus","Fagus","Quercus","Picea", "Abies","Fagus","Fagus","Quercus","Picea", "Abies"),
value = sample(100, size = 15, replace = TRUE))
head(df)
plot species value
1 A Fagus 53
2 A Fagus 48
3 A Quercus 5
4 A Picea 25
5 A Abies 12
6 B Fagus 12
Now, I want to create a new data frame containing per plot values for share.conifers and share.broadleaves by basically summing the values with conditions applied for species. I thought about using case_when but I am not sure how to write the syntax:
df1 <- df %>% share.broadleaves = case_when(plot = plot & species = "Fagus" or species = "Quercus" ~ FUN="sum")
df1 <- df %>% share.conifers = case_when(plot = plot & species = "Abies" or species = "Picea" ~ FUN="sum")
I know this is not right, but I would like something like this.
Using dplyr/tidyr:
First construct the group, do the calculation and then spread into columns.
library(dplyr)
library(tidyr)
df |>
mutate(type = case_when(species %in% c("Fagus", "Quercus") ~ "broadleaves",
species %in% c("Abies", "Picea") ~ "conifers")) |>
group_by(plot, type) |>
summarise(share = sum(value)) |>
ungroup() |>
pivot_wider(values_from = "share", names_from = "type", names_prefix = "share.")
Output:
# A tibble: 3 × 3
plot share.broadleaves share.conifers
<chr> <int> <int>
1 A 159 77
2 B 53 42
3 C 204 63
I am not sure if you want to sum or get the share, but the code could easily be adapted to whatever goal you have.
One way could just be summarizing by plot and species:
library(dplyr)
df |>
group_by(plot, species) |>
summarize(share = sum(value))
If you really want to get the share of a specific species per plot you could also do:
df |>
group_by(plot) |>
summarize(share_certain_species = sum(value[species %in% c("Fagus", "Quercus")]) / sum(value))
which gives:
# A tibble: 3 × 2
plot share_certain_species
<chr> <dbl>
1 A 0.546
2 B 0.583
3 C 0.480

Extract all row.names in a data.frame that match a value in another data.frame

I have a data.frame with 150 column names. For each column, I want to extract the maximum and minimum values (the rows repeat) and the row names of each maximum value. I have extracted the min and max values in another data.frame but don't know how to match them.
I have found functions that are very close for this, like for minimum values:
head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
sapply(cars,which.min)
speed dist
1 1
Here, it only gives the first index for minimum speed.
And I've tried with loops like:
for (i in (colnames(cars))){
print(min(cars[[i]]))
}
[1] 4
[1] 2
But that just gives me the minimum values, and not if they are repeated and the rowname of each repeated value.
I want something like:
min.value column rowname freq.times
4 speed 1,2 2
2 dist 1 1
Thanks and sorry if I have orthography mistakes. No native speaker
One option is to use tidyverse. I was a little unclear if you want min and max in the same dataframe, so I included both. First, I create an index column with row numbers. Then, I pivot to long format to determine which values are minimum and maximum (using case_when). Then, I drop the rows that are not min or max (i.e., NA in category). Then, I use summarise to turn the row names into a single character string and get the frequency of a given minimum or maximum value.
library(tidyverse)
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "value") %>%
group_by(column) %>%
mutate(category = case_when((value == min(value)) == TRUE ~ "min",
(value == max(value)) == TRUE ~ "max")) %>%
drop_na(category) %>%
group_by(column, value, category) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2:3, 1, 4, 5)
Output
# A tibble: 4 × 5
# Groups: column, value [4]
value category column rowname freq.times
<dbl> <chr> <chr> <chr> <int>
1 2 min dist 1 1
2 120 max dist 49 1
3 4 min speed 1, 2 2
4 25 max speed 50 1
However, if you want to produce the dataframes separately. Then, you could adjust something like this. Here, I don't use category and instead use filter to drop all rows that are not the minimum for a group/column. Then, we can summarise as we did above. You can do the samething for max as well.
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "min.value") %>%
group_by(column) %>%
filter(min.value == min(min.value)) %>%
group_by(column, min.value) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2, 1, 3, 4)
Output
# A tibble: 2 × 4
# Groups: column [2]
min.value column rowname freq.times
<dbl> <chr> <chr> <int>
1 2 dist 1 1
2 4 speed 1, 2 2
Here is another tidyverse approach:
which.min(.) gives the first index, whereas which(. == min(.)) will give all indices that are true for the condition!
Analogues to get the frequence we could use: length(which(.==min(.)))
summarise across all columns min.value, rowname and freq.time
The part after is pivoting to bring the column name in position.
library(tidyverse)
cars %>%
summarise(across(dplyr::everything(), list(min.value = min,
rowname = ~list(which(. == min(.))),
freq.times = ~length(which(.==min(.)))))) %>%
pivot_longer(
cols = contains("_"),
names_to = "key",
values_to = "val",
values_transform = list(val = as.character)
) %>%
separate(key, c("column", "name"), sep="_") %>%
pivot_wider(
names_from = name,
values_from = val
) %>%
mutate(rowname = str_replace(rowname, '\\:', '\\,'))
column min.value rowname freq.times
<chr> <chr> <chr> <chr>
1 speed 4 1,2 2
2 dist 2 1 1
min.value <- sapply(cars, min)
columns <- names(min.value)
row.values <- sapply(columns, \(x) which(cars[[x]] == min.value[which(names(min.value) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(min.value) <- names(row.values) <- names(freq.times) <- NULL
data.frame(min.value = min.value,
columns = columns,
row.values = row.values,
freq.times = freq.times)
min.value columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
Here it is wrapped in function, so that you can use it across whatever data frame and function you need:
create_table <- function(df, FUN) {
values <- sapply(df, FUN)
columns <- names(values)
row.values <- sapply(columns, \(x) which(df[[x]] == values[which(names(values) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(values) <- names(row.values) <- names(freq.times) <- NULL
data.frame(values = values,
columns = columns,
row.values = row.values,
freq.times = freq.times)
}
create_table(cars, min)
values columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
create_table(cars, max)
values columns row.values freq.times
1 25 speed 50 1
2 120 dist 49 1
You can use which to obtain the positions. sapply should work. Since you need multiple summary statistics for each column, you just have to wrap up them in a list. Something like this
as.data.frame(sapply(cars, \(x) {
extrema <- range(x)
min.row <- which(x == extrema[[1L]])
max.row <- which(x == extrema[[2L]])
list(
min.value = extrema[[1L]], max.value = extrema[[2L]],
min.row = min.row, max.row = max.row,
freq.min = length(min.row), freq.max = length(max.row)
)
}))
Output
speed dist
min.value 4 2
max.value 25 120
min.row 1, 2 1
max.row 50 49
freq.min 2 1
freq.max 1 1

What is the best way to handle potentially missing columns when summarizing?

A financial statement is a good illustration of this issue. Here is an example dataframe:
df <- data.frame( date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 10),
category = sample(c('a','b', 'c'), 10, replace=TRUE),
direction = sample(c('credit', 'debit'), 10, replace=TRUE),
value = sample(0:25, 10, replace = TRUE) )
I want to produce a summary table with incoming, outgoing and total columns for each category.
df %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE), outgoing=sum(debit,na.rm=TRUE) ) %>%
mutate(total= incoming-outgoing)
In most cases this works perfectly with the example dataframe above.
But there are cases where df$direction could contain a single value e.g., credit, resulting in an error.
Error: Problem with `summarise()` column `outgoing`.
object 'debit' not found
Given that I have no control over the dataframe, what is the best way to handle this?
I've been playing around with a conditional statement in the summarize method to check that the column exists, but have not managed to get this working.
...
summarize( outgoing = case_when(
"debit" %in% colnames(.) ~ sum(debit,na.rm=TRUE),
TRUE ~ 0 ) )
...
Have I made a syntax error, or am I going in completely the wrong direction with this?
The issue happens only when one of the elements is presents i.e. 'credit' and no 'debit' or viceversa. Then, the pivot_wider doesn't create the column missing. Instead of pivoting and then summarising, do this directly with summarise and == i.e. if the 'debit' is absent, sum will take care of it by returning 0
library(dplyr)
df %>%
slice(-c(9:10)) %>% # just removed the 'debit' rows completely
group_by(category) %>%
summarise(total = sum(value[direction == 'credit']) -
sum(value[direction == "debit"]))
-output
# A tibble: 3 × 2
category total
<chr> <int>
1 a 15
2 b 30
3 c 63
With pivot_wider, it is not the case
df %>%
slice(-c(9:10)) %>%
pivot_wider(names_from = direction, values_from = value)
# A tibble: 8 × 3
date category credit
<date> <chr> <int>
1 2020-07-25 c 19
2 2020-05-09 b 15
3 2020-08-27 a 15
4 2020-03-27 b 15
5 2020-04-06 c 6
6 2020-07-06 c 11
7 2020-09-22 c 25
8 2020-10-06 c 2
it creates only the 'credit' column, thus when we call a column 'debit' that is not created, it throws error
df %>%
slice(-c(9:10)) %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE),
outgoing=sum(debit,na.rm=TRUE) )
Error: Problem with summarise() column outgoing.
ℹ outgoing = sum(debit, na.rm = TRUE).
✖ object 'debit' not found
ℹ The error occurred in group 1: category = "a".
Run rlang::last_error() to see where the error occurred.
In this case, we can do a complete to create some rows with debit as well which will have NA for other columns
library(tidyr)
df %>%
slice(-c(9:10)) %>%
complete(category, direction = c("credit", "debit")) %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE),
outgoing=sum(debit,na.rm=TRUE) ) %>%
mutate(total= incoming-outgoing)
# A tibble: 3 × 4
category incoming outgoing total
<chr> <int> <int> <int>
1 a 15 0 15
2 b 30 0 30
3 c 63 0 63

pivot_wider() generates new dataframe filled with NULL values and other misprinted values [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 1 year ago.
I am using pivot_wider() in an attempt to transform this dataframe.
subject_id test_name test_result test_unit
12 Spanish 100 print
12 English 99 online
13 Spanish 98 print
13 English 91 print
Into:
subject_id spanish_test english_test
12 100 99
13 98 91
I used pivot_wider with the following code:
test %>%
pivot_wider(id_cols = subject_id,
names_from = Test_Name,
values_from = Test_Unit)
And I got the individual test columns generated, however, they were filled with the units or NULL values. Here is the dataframe for reference:
subject_id <- c(12, 12, 13, 13)
test_name <- c("Spanish", "English", "Spanish", "English")
test_result <- c(100, 99, 98, 91)
test_unit <- c("print", "online", "print", "print")
df <- data.frame(subject_id, test_name, test_result, test_unit)
You can use pivot_wider as -
tidyr::pivot_wider(df,
id_cols = subject_id,
names_from = test_name,
values_from = test_result,
names_glue = '{test_name}_test')
# subject_id Spanish_test English_test
# <dbl> <dbl> <dbl>
#1 12 100 99
#2 13 98 91
An alternative using reshape2 along with dplyr to rename the columns.
library(reshape2)
library(dplyr)
reshape2::dcast(df, subject_id ~ test_name,
value.var = "test_result") %>%
dplyr::rename_at(vars(Spanish:English), list( ~ paste0(., "_test"))) %>%
dplyr::rename_all(tolower) %>%
dplyr::select(subject_id, spanish_test, english_test)
Output
subject_id spanish_test english_test
1 12 100 99
2 13 98 91

How to transform my data frame to make rows columns?

I have a data frame with two columns, "Type" and "Stats". I want to each type to have one row with all of the stats in a separate column. For example, my data frame looks something like this:
Column Type has values: A A A A B B B B
Column Stats has values:15 2 73 12 12 6 52 17
And I want it to look like:
Column Type has values: A B
Column Stat1 has values: 15 12
Column Stat2 has values: 2 6
Column Stat3 has values: 73 52
Column Stat4 has values: 12 17
Not all types have the same number of stats, some types are missing a stat value and others have extra. I tried using t(), but ran into issues. I then tried to combine all the values of Stat into one column and separate with gsub() and csplit(), but I had issues combining all the Stat values for each type into one column. Any advice?
We can use pivot_wider after creating a sequence column grouped by 'Type'
library(dplyr)
library(tidyr)
df1 %>%
group_by(Type) %>%
mutate(rn = str_c('Stats_', row_number())) %>%
ungroup %>%
pivot_wider(names_from = rn, values_from = Stats)
# A tibble: 2 x 5
# Type Stats_1 Stats_2 Stats_3 Stats_4
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 A 15 2 73 12
#2 B 12 6 52 17
Or using dcast from data.table
library(data.table)
dcast(setDT(df1), Type ~ paste0("Stats_", rowid(Type)), value.var = 'Stats')
Or as #Onyambu suggested in base R, it can be done with reshape
reshape(transform(df1, time = ave(Stats, Type,
FUN = seq_along)), dir="wide", idvar = "Type", sep = "_")
data
df1 <- data.frame(Type = rep(c("A", "B"), each = 4),
Stats = c(15, 2, 73, 12, 12, 6, 52, 17))

Resources