Create a pivot table with multiple hierarchical column groups - r

I'm trying to create a pivot table (to later be rendered in markdown). However, I can't find a way to produce multiple pivot columns.
my data:
| ID | group | var1 | var2 |
| -: |:-----:|:------:|:------:|
| 1 | A | 1 | 2 |
| 2 | B | 3 | 4 |
| 3 | C | 5 | 6 |
| 4 | A | 7 | 8 |
| 5 | B | 9 | 10 |
| 6 | C | 11 | 12 |
required table:
| | groupA | groupB | groupC |
| ID | var1 | var2 | var1 | var2 | var1 | var2 |
| -: |:------:|:------:|:------:|:------:|:------:|:------:|
| 1 | 1 | 2 | | | | |
| 2 | | | 3 | 4 | | |
| 3 | | | | | 5 | 6 |
| 4 | 7 | 8 | | | | |
| 5 | | | 9 | 10 | | |
| 6 | | | | | 11 | 12 |
Obviously the result is not a dataframe or a tibble.
How can such a table be created?

if this is your example data df:
df <- structure(list(ID = 1:6, group = c("A", "B", "C", "A", "B", "C"
), var1 = c(1, 3, 5, 7, 9, 11), var2 = c(2, 4, 6, 8, 10, 12)), class = "data.frame", row.names = c(NA,
-6L))
... you can generate the table structure and column headers like this:
library(tidyr)
df %>%
pivot_longer(cols = starts_with('var'),
names_to = 'var_name',
values_to = 'value'
) %>%
pivot_wider(id_cols = ID,
names_from = c('group', 'var_name'),
names_sep = '\n', ## wrap line after group name
values_from = 'value'
)
Note that AFAIK having the group names span the variable columns would require some separate fiddling between the steps of reshaping your data (see above) and producing the markdown.

Adding on #I_O data transformation, the header for the groups you could achieve with the kableExtra package, i.e.
library(dplyr)
library(tidyr)
library(kableExtra)
options(knitr.kable.NA = '')
df %>%
pivot_longer(cols = starts_with('var'),
names_to = 'var_name',
values_to = 'value'
) %>% pivot_wider(id_cols = ID,
names_from = c('group', 'var_name'),
names_sep = '\n', ## wrap line after group name
values_from = 'value'
) %>%
kbl(col.names = c("ID", "var1", "var2","var1", "var2","var1", "var2")) %>%
add_header_above(c(" ", "groupA" = 2,"groupB" = 2,"groupC" = 2 )) %>%
kable_styling(bootstrap_options = "striped", full_width = F)

Using reshape2
library(reshape2)
dcast(
melt(
df,
id.vars=c("grp1","grp2"),
measure.vars=c("var1","var2")
),
grp1~grp2+variable,
value.var="value"
)
grp1 A_var1 A_var2 B_var1 B_var2 C_var1 C_var2
1 1 1 2 NA NA NA NA
2 2 NA NA 3 4 NA NA
3 3 NA NA NA NA 5 6
4 4 7 8 NA NA NA NA
5 5 NA NA 9 10 NA NA
6 6 NA NA NA NA 11 12

There are two separate issues here. One is how to print a hierarchical table in R. There are a few ways to do this, mostly producing latex or html tables. For a hierarchical table printing in the R console, one option is to use tabular from the tables package:
library(tables)
library(dplyr)
fm <- function(x) if(length(x) == 0) "" else x
tabular( (ID) ~ group*(var1 + var2)*(`---`=fm),
data=mutate(df, ID = factor(ID), group = factor(group)))
#>
#> group
#> A B C
#> var1 var2 var1 var2 var1 var2
#> ID --- --- --- --- --- ---
#> 1 1 2
#> 2 3 4
#> 3 5 6
#> 4 7 8
#> 5 9 10
#> 6 11 12
The second, perhaps more important issue is how to store and work with hierarchical tabular structures. This is possible with nested tibbles. In your case, we can do something like:
library(tidyr)
nested_df <- complete(df, ID, group) %>%
nest_by(ID, group) %>%
pivot_wider(names_from = group, values_from = data)
nested_df
#> # A tibble: 6 x 4
#> ID A B C
#> <int> <list<tibble[,2]>> <list<tibble[,2]>> <list<tibble[,2]>>
#> 1 1 [1 x 2] [1 x 2] [1 x 2]
#> 2 2 [1 x 2] [1 x 2] [1 x 2]
#> 3 3 [1 x 2] [1 x 2] [1 x 2]
#> 4 4 [1 x 2] [1 x 2] [1 x 2]
#> 5 5 [1 x 2] [1 x 2] [1 x 2]
#> 6 6 [1 x 2] [1 x 2] [1 x 2]
To access, say, the var1 and var2 columns for group A we would do:
nested_df %>% select(A) %>% unnest(A)
# A tibble: 6 x 2
var1 var2
<dbl> <dbl>
1 1 2
2 NA NA
3 NA NA
4 7 8
5 NA NA
6 NA NA
Created on 2022-05-25 by the reprex package (v2.0.1)

Related

Merge rows with different values into a single row in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 months ago.
I have a dataset that looks like this:
ID | age | disease
smith192 | 17 | lung_cancer
green484 | 12 | diabetes
green484 | 13 | heart_irregularities
tom584 | 12 | colon_cancer
tom584 | 14 | diabetes
tom584 | 15 | malnutrition
And I would like R to organize it into this:
ID | age_1 | disease_1 | age_2 | disease_2 | age_3 | disease_3 |
smith192 | 17 | lung_cancer | NA | NA | NA | NA |
green484 | 12 | diabetes | 13 | heart_irregularities | NA | NA |
tom584 | 12 | colon_cancer | 14 | diabetes | 15 | malnutrition |
Any help would be greatly appreciated!
You could create disease indices for each ID and then pivot the data to wide.
base
df |>
transform(n = ave(ID, ID, FUN = seq)) |>
reshape(direction = "wide", idvar = "ID", timevar = "n", v.names = c("age", "disease"))
# ID age.1 disease.1 age.2 disease.2 age.3 disease.3
# 1 smith192 17 lung_cancer NA <NA> NA <NA>
# 2 green484 12 diabetes 13 heart_irregularities NA <NA>
# 4 tom584 12 colon_cancer 14 diabetes 15 malnutrition
tidyverse
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
mutate(n = 1:n()) %>%
ungroup() %>%
pivot_wider(ID, names_from = n, values_from = c(age, disease))
# # A tibble: 3 × 7
# ID age_1 age_2 age_3 disease_1 disease_2 disease_3
# <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
# 1 smith192 17 NA NA lung_cancer NA NA
# 2 green484 12 13 NA diabetes heart_irregularities NA
# 3 tom584 12 14 15 colon_cancer diabetes malnutrition
Data
df <- structure(list(ID = c("smith192", "green484", "green484", "tom584",
"tom584", "tom584"), age = c(17, 12, 13, 12, 14, 15), disease = c("lung_cancer",
"diabetes", "heart_irregularities", "colon_cancer", "diabetes",
"malnutrition")), class = "data.frame", row.names = c(NA, -6L))

R: rehape from "wide" to "long", keeping some variables "wide"

I have data file in wide format, with a set of recurring variables (var1 var2, below)
data have:
| ID | background vars| var1.A | var2.A | var1.B | var2.B | var1.C | var2.C |
| -: | :------------- |:------:|:------:|:------:|:------:|:------:|:------:|
| 1 | data1 | 1 | 2 | 3 | 4 | 5 | 6 |
| 2 | data2 | 7 | 8 | 9 | 10 | 11 | 12 |
I need to reshape it "half way" into to long format, i.e. keep a each var group together (wide), and each recurrence in a different line (long).
data want:
| ID | background vars | recurrence | var1 | var2 |
| -: | :-------------- |:----------:|:------:|:------:|
| 1 | data1 | A | 1 | 2 |
| 1 | data1 | B | 3 | 4 |
| 1 | data1 | C | 5 | 6 |
| 2 | data2 | A | 7 | 8 |
| 2 | data2 | B | 9 | 10 |
| 2 | data2 | C | 11 | 12 |
I found some solutions for this using reshape() gather() and melt().
However, all these collapse ALL variables to long format, and do not allow for some variables to be kept "wide").
How can data be shaped this way using R?
Use the keyword '.value' in the names_to argument to keep that part of the column name in wide format:
tidyr::pivot_longer(df, c(-ID, -`background vars`),
names_sep = '\\.',
names_to = c('.value', 'recurrence'))
#> # A tibble: 6 x 5
#> ID `background vars` recurrence var1 var2
#> <int> <chr> <chr> <int> <int>
#> 1 1 data1 A 1 2
#> 2 1 data1 B 3 4
#> 3 1 data1 C 5 6
#> 4 2 data2 A 7 8
#> 5 2 data2 B 9 10
#> 6 2 data2 C 11 1
If you need your code to be easily readable/comprehensible and you feel that ".value" in #Allan's example is a little opaque, you might consider a two-step pivot - simply pivot_long() and then immediately pivot_wide() with different parameters:
df <- structure(
list(
ID = 1:2,
background.vars = c("data1", "data2"),
var1.A = c(1L, 7L),
var2.A = c(2L, 8L),
var1.B = c(3L, 9L),
var2.B = c(4L, 10L),
var1.C = c(5L, 11L),
var2.C = c(6L, 12L)),
class = "data.frame",
row.names = c(NA, -2L)
)
require(tidyr)
#> Loading required package: tidyr
long.df <-
pivot_longer(df,
c(-ID, -`background.vars`), #lengthen all columns but these
names_sep = "\\.", #split column names wherever there is a '.'
names_to = c("var", "letter"))
long.df
#> # A tibble: 12 × 5
#> ID background.vars var letter value
#> <int> <chr> <chr> <chr> <int>
#> 1 1 data1 var1 A 1
#> 2 1 data1 var2 A 2
#> 3 1 data1 var1 B 3
#> 4 1 data1 var2 B 4
#> 5 1 data1 var1 C 5
#> 6 1 data1 var2 C 6
#> 7 2 data2 var1 A 7
#> 8 2 data2 var2 A 8
#> 9 2 data2 var1 B 9
#> 10 2 data2 var2 B 10
#> 11 2 data2 var1 C 11
#> 12 2 data2 var2 C 12
pivot_wider(long.df, names_from = "var")
#> # A tibble: 6 × 5
#> ID background.vars letter var1 var2
#> <int> <chr> <chr> <int> <int>
#> 1 1 data1 A 1 2
#> 2 1 data1 B 3 4
#> 3 1 data1 C 5 6
#> 4 2 data2 A 7 8
#> 5 2 data2 B 9 10
#> 6 2 data2 C 11 12
Created on 2022-05-24 by the reprex package (v2.0.1)

How can I organize the code in a long format depending on which time of measurement

I have a question about converting a dataframe from a wide format into a long format. I haven't found any solutions that fit with my dataframe. We had three measurement timeslots with the same questionnaires (e.g. PANAS and two more questionnaires). My dataframe looks like this right now:
| code| PANAS_1| PANAS_2| PANAS1_1| PANAS1_2| PANAS2_1| PANAS2_2|
|CAPQ | 4 | 3 | 1 | 5 | 2 | 4 |
|BANI | 2 | 3 | 4 | 4 | 3 | 2 |
I want to put it into a format that looks like this:
| code| timeslot| PANAS_1| PANAS_2 |
|CAPQ | 1 | 4 | 3 |
|CAPQ | 2 | 1 | 5 |
|CAPQ | 3 | 2 | 4 |
|BANI | 1 | 2 | 3 |
|BANI | 2 | 4 | 4 |
|BANI | 3 | 3 | 2 |
I tried melt(), but I just don't know what to do because the variable names of the questionnaires aren't the same (the name of the variables in the first timeslot are plain "PANAS_1", the ones in the second timeslot begin with a 1 "PANAS1_1" and the ones in the third timeslot begin with a 2 "PANAS2_1). On top of that I have no variable that explains from what timeslot condition the items are.
I hope you can understand my problem and help me solve this. If you need further information, just let me know.
Here is an approach using data.table. With melt.data.table() you can use groups of measure.vars. In this case you can use patterns() to find the the groups by their suffix.
library(data.table)
df <- read.table(text = "code| PANAS_1| PANAS_2| PANAS1_1| PANAS1_2| PANAS2_1| PANAS2_2
CAPQ | 4 | 3 | 1 | 5 | 2 | 4
BANI | 2 | 3 | 4 | 4 | 3 | 2
", sep = "|", header = TRUE)
setDT(df)
DT.long <- melt(df,
id.vars = "code",
measure.vars = patterns("_1", "_2"),
variable.name = "timeslot",
value.name = c("PANAS_1", "PANAS_2")
)[order(code), ]
DT.long
#> code timeslot PANAS_1 PANAS_2
#> 1: BANI 1 2 3
#> 2: BANI 2 4 4
#> 3: BANI 3 3 2
#> 4: CAPQ 1 4 3
#> 5: CAPQ 2 1 5
#> 6: CAPQ 3 2 4
Created on 2021-08-19 by the reprex package (v2.0.1)
Here is one approach using tidyverse. You can use pivot_longer to put into long format, and separate out the last number after the underscore. Then, you can add a timeslot variable for each code/number combination, assuming the times are in order. Finally, you can revert to wide format with pivot_wider (or leave as is for further processing/analysis).
library(tidyverse)
df %>%
pivot_longer(cols = -code, names_to = c("var", "PANAS"), names_sep = "_") %>%
group_by(code, PANAS) %>%
mutate(timeslot = 1:n()) %>%
pivot_wider(id_cols = c(code, timeslot), names_from = PANAS, names_prefix = "PANAS_", values_from = value)
Output
code timeslot PANAS_1 PANAS_2
<chr> <int> <dbl> <dbl>
1 CAPQ 1 4 3
2 CAPQ 2 1 5
3 CAPQ 3 2 4
4 BANI 1 2 3
5 BANI 2 4 4
6 BANI 3 3 2
Alternatively, you can rename your column names and include the time inside them explicitly:
names(df) <- c("code", paste("PANAS", rep(1:3, each = 2), rep(1:2, times = 3), sep = "_"))
df %>%
pivot_longer(cols = -code, names_to = c("timeslot", "PANAS"), names_pattern = "PANAS_(\\d+)_(\\d+)") %>%
pivot_wider(id_cols = c(code, timeslot), names_from = PANAS, names_prefix = "PANAS_", values_from = value)

How to assign a number between 1 and n in R to rows?

I would like to assign individual in my data randomly to a group numbered 1 though 3, how would I do this? ( a DPLYR Solution is preferred), individuals (rows with the same id# must be in the same group)
_______________________
id # | group_id |
454452 | 1 |
5450441 | 2 |
5444531 | 3 |
5444531 | 3 |
5404501 | 1 |
5404041 | 2 |
5404041 | 2 |
254252 | 3 |
541254 | 2 |
_______________________
A simple solution might be:
df <- df %>% group_by(id) %>% mutate(group_id = sample(1:3,1))
which (using set.seed(12345)) resulted in:
id group_id
1 454452 3
2 5450441 1
3 5444531 2
4 5444531 2
5 5404501 2
6 5404041 3
7 5404041 3
8 254252 2
9 541254 2
Here's one option:
library(dplyr)
df <-
tibble(ids = c(100, 200, 200, 300, 300, 400))
distinct_ids <-
df %>%
select(ids) %>%
distinct() %>%
mutate(group_num = sample.int(3, size = nrow(.), replace = TRUE))
df %>%
left_join(distinct_ids, by = "ids")
# A tibble: 6 x 2
ids group_num
<dbl> <int>
1 100 3
2 200 1
3 200 1
4 300 3
5 300 3
6 400 2
In base R we could sample the factorized "id" and display them as.numeric.
set.seed(42) # for sake of reproducibility
dat <- transform(dat, group_id=as.numeric(factor(id, levels=sample(unique(dat$id)))))
dat
# id X1 X2 X3 group_id
# 1 454452 -1.1045994 0.0356312 1.93557177 1
# 2 5450441 0.5390238 1.3149588 1.72323080 5
# 3 5444531 0.5802063 0.9781675 0.35840206 6
# 4 5444531 -0.6575028 0.8817912 0.30243092 6
# 5 5404501 1.5548955 0.4822047 -0.39411451 7
# 6 5404041 -1.1876414 0.9657529 0.78814062 2
# 7 5404041 0.1518129 -0.8145709 0.67070383 2
# 8 254252 -1.0861326 0.2839578 -0.94918081 4
# 9 541254 1.6133728 -0.1616986 0.03613574 3
Data
dat <- structure(list(id = c(454452L, 5450441L, 5444531L, 5444531L,
5404501L, 5404041L, 5404041L, 254252L, 541254L), X1 = c(-1.10459944068306,
0.539023801893912, 0.580206320853481, -0.657502835154674, 1.55489554810057,
-1.18764140164182, 0.151812914504533, -1.08613257605253, 1.61337280035418
), X2 = c(0.0356311982051355, 1.31495884897891, 0.978167526364279,
0.881791226863203, 0.482204688262918, 0.965752878105794, -0.814570938270238,
0.283957806364306, -0.161698647607024), X3 = c(1.93557176599585,
1.72323079854894, 0.358402056802064, 0.3024309248682, -0.394114506412192,
0.788140622823556, 0.67070382675052, -0.949180809687611, 0.0361357384849679
)), class = "data.frame", row.names = c(NA, -9L))

shift cell value form one time stamp to other in R

Is it possible to shift data of one cell in a column from one timestamp to other in a time series data without losing any other data? I have tried shift and slide functions but it replaces the data with NA values.
I have tried using mutate function as well but it changes the complete column.Is There any function or method to perform manipulation?
E.g, convert :
Date_Time | x | y
01-01-2016 | 1 | 2
02-01-2016 | 3 | 4
03-01-2016 | 5 | 6
04-01-2016 | 2 | 5
to:
Date_Time | x | y
01-01-2016 | 5 | 2
02-01-2016 | 3 | 4
03-01-2016 | 1 | 6
04-01-2016 | 2 | 5
or slide the data vertically
Date_Time | x | y
01-01-2016 | 2 | 2
02-01-2016 | 1 | 4
03-01-2016 | 3 | 6
04-01-2016 | 5 | 5
Two swap two values you need to hold one in a temporary variable. We can write a simple function:
swap = function(x, i, j) {
stopifnot(length(i) == length(j))
temp = x[i]
x[i] = x[j]
x[j] = temp
return(x)
}
On your data, it should work like this to give the desired result:
your_data$x = swap(your_data$x, which.min(your_data$x), which.max(your_data$x))
Two other options with dplyr:
library(dplyr)
df %>%
mutate(x = case_when(
x == max(x) ~ min(x),
x == min(x) ~ max(x),
TRUE ~ x
))
df %>%
mutate(x = replace(x, c(which.max(x), which.min(x)), c(min(x), max(x))))
Result:
Date_Time x y
1 01-01-2016 5 2
2 02-01-2016 3 4
3 03-01-2016 1 6
4 04-01-2016 2 5
To shift x vertically:
df %>%
mutate(x = c(x[-1], x[1]))
or
df %>%
mutate(x = c(x[length(x)], x[-length(x)]))
Result:
> df %>%
+ mutate(x = c(x[-1], x[1]))
Date_Time x y
1 01-01-2016 3 2
2 02-01-2016 5 4
3 03-01-2016 2 6
4 04-01-2016 1 5
> df %>%
+ mutate(x = c(x[length(x)], x[-length(x)]))
Date_Time x y
1 01-01-2016 2 2
2 02-01-2016 1 4
3 03-01-2016 3 6
4 04-01-2016 5 5
Data:
df = read.table(text = "Date_Time | x | y
01-01-2016 | 1 | 2
02-01-2016 | 3 | 4
03-01-2016 | 5 | 6
04-01-2016 | 2 | 5", header = TRUE, sep = "|")

Resources