R: rehape from "wide" to "long", keeping some variables "wide" - r

I have data file in wide format, with a set of recurring variables (var1 var2, below)
data have:
| ID | background vars| var1.A | var2.A | var1.B | var2.B | var1.C | var2.C |
| -: | :------------- |:------:|:------:|:------:|:------:|:------:|:------:|
| 1 | data1 | 1 | 2 | 3 | 4 | 5 | 6 |
| 2 | data2 | 7 | 8 | 9 | 10 | 11 | 12 |
I need to reshape it "half way" into to long format, i.e. keep a each var group together (wide), and each recurrence in a different line (long).
data want:
| ID | background vars | recurrence | var1 | var2 |
| -: | :-------------- |:----------:|:------:|:------:|
| 1 | data1 | A | 1 | 2 |
| 1 | data1 | B | 3 | 4 |
| 1 | data1 | C | 5 | 6 |
| 2 | data2 | A | 7 | 8 |
| 2 | data2 | B | 9 | 10 |
| 2 | data2 | C | 11 | 12 |
I found some solutions for this using reshape() gather() and melt().
However, all these collapse ALL variables to long format, and do not allow for some variables to be kept "wide").
How can data be shaped this way using R?

Use the keyword '.value' in the names_to argument to keep that part of the column name in wide format:
tidyr::pivot_longer(df, c(-ID, -`background vars`),
names_sep = '\\.',
names_to = c('.value', 'recurrence'))
#> # A tibble: 6 x 5
#> ID `background vars` recurrence var1 var2
#> <int> <chr> <chr> <int> <int>
#> 1 1 data1 A 1 2
#> 2 1 data1 B 3 4
#> 3 1 data1 C 5 6
#> 4 2 data2 A 7 8
#> 5 2 data2 B 9 10
#> 6 2 data2 C 11 1

If you need your code to be easily readable/comprehensible and you feel that ".value" in #Allan's example is a little opaque, you might consider a two-step pivot - simply pivot_long() and then immediately pivot_wide() with different parameters:
df <- structure(
list(
ID = 1:2,
background.vars = c("data1", "data2"),
var1.A = c(1L, 7L),
var2.A = c(2L, 8L),
var1.B = c(3L, 9L),
var2.B = c(4L, 10L),
var1.C = c(5L, 11L),
var2.C = c(6L, 12L)),
class = "data.frame",
row.names = c(NA, -2L)
)
require(tidyr)
#> Loading required package: tidyr
long.df <-
pivot_longer(df,
c(-ID, -`background.vars`), #lengthen all columns but these
names_sep = "\\.", #split column names wherever there is a '.'
names_to = c("var", "letter"))
long.df
#> # A tibble: 12 × 5
#> ID background.vars var letter value
#> <int> <chr> <chr> <chr> <int>
#> 1 1 data1 var1 A 1
#> 2 1 data1 var2 A 2
#> 3 1 data1 var1 B 3
#> 4 1 data1 var2 B 4
#> 5 1 data1 var1 C 5
#> 6 1 data1 var2 C 6
#> 7 2 data2 var1 A 7
#> 8 2 data2 var2 A 8
#> 9 2 data2 var1 B 9
#> 10 2 data2 var2 B 10
#> 11 2 data2 var1 C 11
#> 12 2 data2 var2 C 12
pivot_wider(long.df, names_from = "var")
#> # A tibble: 6 × 5
#> ID background.vars letter var1 var2
#> <int> <chr> <chr> <int> <int>
#> 1 1 data1 A 1 2
#> 2 1 data1 B 3 4
#> 3 1 data1 C 5 6
#> 4 2 data2 A 7 8
#> 5 2 data2 B 9 10
#> 6 2 data2 C 11 12
Created on 2022-05-24 by the reprex package (v2.0.1)

Related

Merge rows with different values into a single row in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 months ago.
I have a dataset that looks like this:
ID | age | disease
smith192 | 17 | lung_cancer
green484 | 12 | diabetes
green484 | 13 | heart_irregularities
tom584 | 12 | colon_cancer
tom584 | 14 | diabetes
tom584 | 15 | malnutrition
And I would like R to organize it into this:
ID | age_1 | disease_1 | age_2 | disease_2 | age_3 | disease_3 |
smith192 | 17 | lung_cancer | NA | NA | NA | NA |
green484 | 12 | diabetes | 13 | heart_irregularities | NA | NA |
tom584 | 12 | colon_cancer | 14 | diabetes | 15 | malnutrition |
Any help would be greatly appreciated!
You could create disease indices for each ID and then pivot the data to wide.
base
df |>
transform(n = ave(ID, ID, FUN = seq)) |>
reshape(direction = "wide", idvar = "ID", timevar = "n", v.names = c("age", "disease"))
# ID age.1 disease.1 age.2 disease.2 age.3 disease.3
# 1 smith192 17 lung_cancer NA <NA> NA <NA>
# 2 green484 12 diabetes 13 heart_irregularities NA <NA>
# 4 tom584 12 colon_cancer 14 diabetes 15 malnutrition
tidyverse
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
mutate(n = 1:n()) %>%
ungroup() %>%
pivot_wider(ID, names_from = n, values_from = c(age, disease))
# # A tibble: 3 × 7
# ID age_1 age_2 age_3 disease_1 disease_2 disease_3
# <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
# 1 smith192 17 NA NA lung_cancer NA NA
# 2 green484 12 13 NA diabetes heart_irregularities NA
# 3 tom584 12 14 15 colon_cancer diabetes malnutrition
Data
df <- structure(list(ID = c("smith192", "green484", "green484", "tom584",
"tom584", "tom584"), age = c(17, 12, 13, 12, 14, 15), disease = c("lung_cancer",
"diabetes", "heart_irregularities", "colon_cancer", "diabetes",
"malnutrition")), class = "data.frame", row.names = c(NA, -6L))

Create a pivot table with multiple hierarchical column groups

I'm trying to create a pivot table (to later be rendered in markdown). However, I can't find a way to produce multiple pivot columns.
my data:
| ID | group | var1 | var2 |
| -: |:-----:|:------:|:------:|
| 1 | A | 1 | 2 |
| 2 | B | 3 | 4 |
| 3 | C | 5 | 6 |
| 4 | A | 7 | 8 |
| 5 | B | 9 | 10 |
| 6 | C | 11 | 12 |
required table:
| | groupA | groupB | groupC |
| ID | var1 | var2 | var1 | var2 | var1 | var2 |
| -: |:------:|:------:|:------:|:------:|:------:|:------:|
| 1 | 1 | 2 | | | | |
| 2 | | | 3 | 4 | | |
| 3 | | | | | 5 | 6 |
| 4 | 7 | 8 | | | | |
| 5 | | | 9 | 10 | | |
| 6 | | | | | 11 | 12 |
Obviously the result is not a dataframe or a tibble.
How can such a table be created?
if this is your example data df:
df <- structure(list(ID = 1:6, group = c("A", "B", "C", "A", "B", "C"
), var1 = c(1, 3, 5, 7, 9, 11), var2 = c(2, 4, 6, 8, 10, 12)), class = "data.frame", row.names = c(NA,
-6L))
... you can generate the table structure and column headers like this:
library(tidyr)
df %>%
pivot_longer(cols = starts_with('var'),
names_to = 'var_name',
values_to = 'value'
) %>%
pivot_wider(id_cols = ID,
names_from = c('group', 'var_name'),
names_sep = '\n', ## wrap line after group name
values_from = 'value'
)
Note that AFAIK having the group names span the variable columns would require some separate fiddling between the steps of reshaping your data (see above) and producing the markdown.
Adding on #I_O data transformation, the header for the groups you could achieve with the kableExtra package, i.e.
library(dplyr)
library(tidyr)
library(kableExtra)
options(knitr.kable.NA = '')
df %>%
pivot_longer(cols = starts_with('var'),
names_to = 'var_name',
values_to = 'value'
) %>% pivot_wider(id_cols = ID,
names_from = c('group', 'var_name'),
names_sep = '\n', ## wrap line after group name
values_from = 'value'
) %>%
kbl(col.names = c("ID", "var1", "var2","var1", "var2","var1", "var2")) %>%
add_header_above(c(" ", "groupA" = 2,"groupB" = 2,"groupC" = 2 )) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Using reshape2
library(reshape2)
dcast(
melt(
df,
id.vars=c("grp1","grp2"),
measure.vars=c("var1","var2")
),
grp1~grp2+variable,
value.var="value"
)
grp1 A_var1 A_var2 B_var1 B_var2 C_var1 C_var2
1 1 1 2 NA NA NA NA
2 2 NA NA 3 4 NA NA
3 3 NA NA NA NA 5 6
4 4 7 8 NA NA NA NA
5 5 NA NA 9 10 NA NA
6 6 NA NA NA NA 11 12
There are two separate issues here. One is how to print a hierarchical table in R. There are a few ways to do this, mostly producing latex or html tables. For a hierarchical table printing in the R console, one option is to use tabular from the tables package:
library(tables)
library(dplyr)
fm <- function(x) if(length(x) == 0) "" else x
tabular( (ID) ~ group*(var1 + var2)*(`---`=fm),
data=mutate(df, ID = factor(ID), group = factor(group)))
#>
#> group
#> A B C
#> var1 var2 var1 var2 var1 var2
#> ID --- --- --- --- --- ---
#> 1 1 2
#> 2 3 4
#> 3 5 6
#> 4 7 8
#> 5 9 10
#> 6 11 12
The second, perhaps more important issue is how to store and work with hierarchical tabular structures. This is possible with nested tibbles. In your case, we can do something like:
library(tidyr)
nested_df <- complete(df, ID, group) %>%
nest_by(ID, group) %>%
pivot_wider(names_from = group, values_from = data)
nested_df
#> # A tibble: 6 x 4
#> ID A B C
#> <int> <list<tibble[,2]>> <list<tibble[,2]>> <list<tibble[,2]>>
#> 1 1 [1 x 2] [1 x 2] [1 x 2]
#> 2 2 [1 x 2] [1 x 2] [1 x 2]
#> 3 3 [1 x 2] [1 x 2] [1 x 2]
#> 4 4 [1 x 2] [1 x 2] [1 x 2]
#> 5 5 [1 x 2] [1 x 2] [1 x 2]
#> 6 6 [1 x 2] [1 x 2] [1 x 2]
To access, say, the var1 and var2 columns for group A we would do:
nested_df %>% select(A) %>% unnest(A)
# A tibble: 6 x 2
var1 var2
<dbl> <dbl>
1 1 2
2 NA NA
3 NA NA
4 7 8
5 NA NA
6 NA NA
Created on 2022-05-25 by the reprex package (v2.0.1)

Expand data frame and add a new variable

I have a data frame structured like this:
+----------+------+--------+-------+
| Location | year | group1 | Value |
+----------+------+--------+-------+
| a | 2020 | 1 | x |
| a | 2020 | 2 | y |
| a | 2020 | 3 | z |
| a | 2021 | 1 | x |
| a | 2021 | 2 | y |
| a | 2021 | 3 | z |
| b | 2020 | 1 | x |
| b | 2020 | 2 | y |
| b | 2020 | 3 | z |
+----------+------+--------+-------+
I would like to expand the data frame to include 3 rows for every location, year, and group1 combination and generate a group2 variable that identifies these new combinations (1-3). Ideally, the data frame will look like this:
+----------+------+--------+-------+--------+
| Location | year | group1 | Value | group2 |
+----------+------+--------+-------+--------+
| a | 2020 | 1 | x | 1 |
| a | 2020 | 1 | x | 2 |
| a | 2020 | 1 | x | 3 |
| a | 2020 | 2 | y | 1 |
| a | 2020 | 2 | y | 2 |
| a | 2020 | 2 | y | 3 |
| ... | ... |... |... |... |
+----------+------+--------+-------+--------+
I was able to expand the dataframe to the correct number of total rows using the following code:
df[rep(seq_len(nrow(df)),3), 1:4]
But couldn't figure out how to add the group2 variable shown above.
With tidyr you can use expand - this will expand your data frame to all combinations of values with your sequence of 1 to 3:
library(tidyverse)
df %>%
group_by(Location, year, group1, Value) %>%
expand(group2 = 1:3)
Output
Location year group1 Value group2
<fct> <dbl> <int> <fct> <int>
1 a 2020 1 x 1
2 a 2020 1 x 2
3 a 2020 1 x 3
4 a 2020 2 y 1
5 a 2020 2 y 2
6 a 2020 2 y 3
...
Your approach looks close, and I suppose you could just add on group2 like this:
cbind(df[rep(seq_len(nrow(df)), each = 3), ], group2 = 1:3)
Here is the solution you are looking for
library(dplyr)
# 1. Data set
df <- data.table(
location = c("a","a","a","a","a","a","b","b","b"),
year = c(2020,2020,2020,2021,2021,2021,2020,2020,2020),
group1 = c(1,2,3,1,2,3,1,2,3),
value = c("x","y","z","x","y","z","x","y","z"),
stringsAsFactors = FALSE)
# 2. Your code to expand data frame
df <- df[rep(seq_len(nrow(df)), 3), 1:4]
# 3. Arrange
df <- df %>% arrange(location, year, group1, value)
# 4. Add 'group2'
df <- df %>%
group_by(location, year, group1, value) %>%
mutate(group2 = cumsum(group1) / group1) %>%
arrange(location, year, group1, value, group2)
Hope it works
We can use crossing from tidyr
library(tidyr)
library(dplyr)
crossing(df1, group2 = 1:3)
# A tibble: 27 x 5
# Location year group1 Value group2
# <chr> <int> <int> <chr> <int>
# 1 a 2020 1 x 1
# 2 a 2020 1 x 2
# 3 a 2020 1 x 3
# 4 a 2020 2 y 1
# 5 a 2020 2 y 2
# 6 a 2020 2 y 3
# 7 a 2020 3 z 1
# 8 a 2020 3 z 2
# 9 a 2020 3 z 3
#10 a 2021 1 x 1
# … with 17 more rows
Or create a list column and then unnest
df1 %>%
mutate(group2 = list(1:3)) %>%
unnest(c(group2))
data
df1 <- structure(list(Location = c("a", "a", "a", "a", "a", "a", "b",
"b", "b"), year = c(2020L, 2020L, 2020L, 2021L, 2021L, 2021L,
2020L, 2020L, 2020L), group1 = c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L), Value = c("x", "y", "z", "x", "y", "z", "x", "y", "z"
)), class = "data.frame", row.names = c(NA, -9L))

How to assign a number between 1 and n in R to rows?

I would like to assign individual in my data randomly to a group numbered 1 though 3, how would I do this? ( a DPLYR Solution is preferred), individuals (rows with the same id# must be in the same group)
_______________________
id # | group_id |
454452 | 1 |
5450441 | 2 |
5444531 | 3 |
5444531 | 3 |
5404501 | 1 |
5404041 | 2 |
5404041 | 2 |
254252 | 3 |
541254 | 2 |
_______________________
A simple solution might be:
df <- df %>% group_by(id) %>% mutate(group_id = sample(1:3,1))
which (using set.seed(12345)) resulted in:
id group_id
1 454452 3
2 5450441 1
3 5444531 2
4 5444531 2
5 5404501 2
6 5404041 3
7 5404041 3
8 254252 2
9 541254 2
Here's one option:
library(dplyr)
df <-
tibble(ids = c(100, 200, 200, 300, 300, 400))
distinct_ids <-
df %>%
select(ids) %>%
distinct() %>%
mutate(group_num = sample.int(3, size = nrow(.), replace = TRUE))
df %>%
left_join(distinct_ids, by = "ids")
# A tibble: 6 x 2
ids group_num
<dbl> <int>
1 100 3
2 200 1
3 200 1
4 300 3
5 300 3
6 400 2
In base R we could sample the factorized "id" and display them as.numeric.
set.seed(42) # for sake of reproducibility
dat <- transform(dat, group_id=as.numeric(factor(id, levels=sample(unique(dat$id)))))
dat
# id X1 X2 X3 group_id
# 1 454452 -1.1045994 0.0356312 1.93557177 1
# 2 5450441 0.5390238 1.3149588 1.72323080 5
# 3 5444531 0.5802063 0.9781675 0.35840206 6
# 4 5444531 -0.6575028 0.8817912 0.30243092 6
# 5 5404501 1.5548955 0.4822047 -0.39411451 7
# 6 5404041 -1.1876414 0.9657529 0.78814062 2
# 7 5404041 0.1518129 -0.8145709 0.67070383 2
# 8 254252 -1.0861326 0.2839578 -0.94918081 4
# 9 541254 1.6133728 -0.1616986 0.03613574 3
Data
dat <- structure(list(id = c(454452L, 5450441L, 5444531L, 5444531L,
5404501L, 5404041L, 5404041L, 254252L, 541254L), X1 = c(-1.10459944068306,
0.539023801893912, 0.580206320853481, -0.657502835154674, 1.55489554810057,
-1.18764140164182, 0.151812914504533, -1.08613257605253, 1.61337280035418
), X2 = c(0.0356311982051355, 1.31495884897891, 0.978167526364279,
0.881791226863203, 0.482204688262918, 0.965752878105794, -0.814570938270238,
0.283957806364306, -0.161698647607024), X3 = c(1.93557176599585,
1.72323079854894, 0.358402056802064, 0.3024309248682, -0.394114506412192,
0.788140622823556, 0.67070382675052, -0.949180809687611, 0.0361357384849679
)), class = "data.frame", row.names = c(NA, -9L))

Add Previous Row to Corresponding Column by Group in R

I will post a reproducible Example.
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
I want something like this as end result.
+====+========+========+
| id | group1 | group2 |
+====+========+========+
| 1 | a | b |
+----+--------+--------+
| 1 | b | c |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
| 2 | a | b |
+----+--------+--------+
| 2 | b | - |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
Just to mention the order of ID's matter. I have another column as timestamp.
One solution with dplyr and rleid from data.table:
library(dplyr)
df %>%
mutate(id2 = data.table::rleid(id)) %>%
group_by(id2) %>%
mutate(group2 = lead(group))
# A tibble: 8 x 4
# Groups: id2 [3]
id group id2 group2
<dbl> <fct> <int> <fct>
1 1.00 a 1 b
2 1.00 b 1 c
3 1.00 c 1 d
4 1.00 d 1 NA
5 2.00 a 2 b
6 2.00 b 2 NA
7 1.00 c 3 d
8 1.00 d 3 NA
If I understood correct your question, you can use the following function:
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
add_group2 <- function(df) {
n <-length(group)
group2 <- as.character(df$group[2:n])
group2 <- c(group2, "-")
group2[which(c(df$id[-n] - c(df$id[2:n]), 0) != 0)] <- "-"
return(data.frame(df, group2))
}
add_group2(df)
Result should be:
id group group2
1 1 a b
2 1 b c
3 1 c d
4 1 d -
5 2 a b
6 2 b -
7 1 c d
8 1 d -

Resources