Merge rows with different values into a single row in R [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 months ago.
I have a dataset that looks like this:
ID | age | disease
smith192 | 17 | lung_cancer
green484 | 12 | diabetes
green484 | 13 | heart_irregularities
tom584 | 12 | colon_cancer
tom584 | 14 | diabetes
tom584 | 15 | malnutrition
And I would like R to organize it into this:
ID | age_1 | disease_1 | age_2 | disease_2 | age_3 | disease_3 |
smith192 | 17 | lung_cancer | NA | NA | NA | NA |
green484 | 12 | diabetes | 13 | heart_irregularities | NA | NA |
tom584 | 12 | colon_cancer | 14 | diabetes | 15 | malnutrition |
Any help would be greatly appreciated!

You could create disease indices for each ID and then pivot the data to wide.
base
df |>
transform(n = ave(ID, ID, FUN = seq)) |>
reshape(direction = "wide", idvar = "ID", timevar = "n", v.names = c("age", "disease"))
# ID age.1 disease.1 age.2 disease.2 age.3 disease.3
# 1 smith192 17 lung_cancer NA <NA> NA <NA>
# 2 green484 12 diabetes 13 heart_irregularities NA <NA>
# 4 tom584 12 colon_cancer 14 diabetes 15 malnutrition
tidyverse
library(dplyr)
library(tidyr)
df %>%
group_by(ID) %>%
mutate(n = 1:n()) %>%
ungroup() %>%
pivot_wider(ID, names_from = n, values_from = c(age, disease))
# # A tibble: 3 × 7
# ID age_1 age_2 age_3 disease_1 disease_2 disease_3
# <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
# 1 smith192 17 NA NA lung_cancer NA NA
# 2 green484 12 13 NA diabetes heart_irregularities NA
# 3 tom584 12 14 15 colon_cancer diabetes malnutrition
Data
df <- structure(list(ID = c("smith192", "green484", "green484", "tom584",
"tom584", "tom584"), age = c(17, 12, 13, 12, 14, 15), disease = c("lung_cancer",
"diabetes", "heart_irregularities", "colon_cancer", "diabetes",
"malnutrition")), class = "data.frame", row.names = c(NA, -6L))

Related

R: rehape from "wide" to "long", keeping some variables "wide"

I have data file in wide format, with a set of recurring variables (var1 var2, below)
data have:
| ID | background vars| var1.A | var2.A | var1.B | var2.B | var1.C | var2.C |
| -: | :------------- |:------:|:------:|:------:|:------:|:------:|:------:|
| 1 | data1 | 1 | 2 | 3 | 4 | 5 | 6 |
| 2 | data2 | 7 | 8 | 9 | 10 | 11 | 12 |
I need to reshape it "half way" into to long format, i.e. keep a each var group together (wide), and each recurrence in a different line (long).
data want:
| ID | background vars | recurrence | var1 | var2 |
| -: | :-------------- |:----------:|:------:|:------:|
| 1 | data1 | A | 1 | 2 |
| 1 | data1 | B | 3 | 4 |
| 1 | data1 | C | 5 | 6 |
| 2 | data2 | A | 7 | 8 |
| 2 | data2 | B | 9 | 10 |
| 2 | data2 | C | 11 | 12 |
I found some solutions for this using reshape() gather() and melt().
However, all these collapse ALL variables to long format, and do not allow for some variables to be kept "wide").
How can data be shaped this way using R?
Use the keyword '.value' in the names_to argument to keep that part of the column name in wide format:
tidyr::pivot_longer(df, c(-ID, -`background vars`),
names_sep = '\\.',
names_to = c('.value', 'recurrence'))
#> # A tibble: 6 x 5
#> ID `background vars` recurrence var1 var2
#> <int> <chr> <chr> <int> <int>
#> 1 1 data1 A 1 2
#> 2 1 data1 B 3 4
#> 3 1 data1 C 5 6
#> 4 2 data2 A 7 8
#> 5 2 data2 B 9 10
#> 6 2 data2 C 11 1
If you need your code to be easily readable/comprehensible and you feel that ".value" in #Allan's example is a little opaque, you might consider a two-step pivot - simply pivot_long() and then immediately pivot_wide() with different parameters:
df <- structure(
list(
ID = 1:2,
background.vars = c("data1", "data2"),
var1.A = c(1L, 7L),
var2.A = c(2L, 8L),
var1.B = c(3L, 9L),
var2.B = c(4L, 10L),
var1.C = c(5L, 11L),
var2.C = c(6L, 12L)),
class = "data.frame",
row.names = c(NA, -2L)
)
require(tidyr)
#> Loading required package: tidyr
long.df <-
pivot_longer(df,
c(-ID, -`background.vars`), #lengthen all columns but these
names_sep = "\\.", #split column names wherever there is a '.'
names_to = c("var", "letter"))
long.df
#> # A tibble: 12 × 5
#> ID background.vars var letter value
#> <int> <chr> <chr> <chr> <int>
#> 1 1 data1 var1 A 1
#> 2 1 data1 var2 A 2
#> 3 1 data1 var1 B 3
#> 4 1 data1 var2 B 4
#> 5 1 data1 var1 C 5
#> 6 1 data1 var2 C 6
#> 7 2 data2 var1 A 7
#> 8 2 data2 var2 A 8
#> 9 2 data2 var1 B 9
#> 10 2 data2 var2 B 10
#> 11 2 data2 var1 C 11
#> 12 2 data2 var2 C 12
pivot_wider(long.df, names_from = "var")
#> # A tibble: 6 × 5
#> ID background.vars letter var1 var2
#> <int> <chr> <chr> <int> <int>
#> 1 1 data1 A 1 2
#> 2 1 data1 B 3 4
#> 3 1 data1 C 5 6
#> 4 2 data2 A 7 8
#> 5 2 data2 B 9 10
#> 6 2 data2 C 11 12
Created on 2022-05-24 by the reprex package (v2.0.1)

Create a pivot table with multiple hierarchical column groups

I'm trying to create a pivot table (to later be rendered in markdown). However, I can't find a way to produce multiple pivot columns.
my data:
| ID | group | var1 | var2 |
| -: |:-----:|:------:|:------:|
| 1 | A | 1 | 2 |
| 2 | B | 3 | 4 |
| 3 | C | 5 | 6 |
| 4 | A | 7 | 8 |
| 5 | B | 9 | 10 |
| 6 | C | 11 | 12 |
required table:
| | groupA | groupB | groupC |
| ID | var1 | var2 | var1 | var2 | var1 | var2 |
| -: |:------:|:------:|:------:|:------:|:------:|:------:|
| 1 | 1 | 2 | | | | |
| 2 | | | 3 | 4 | | |
| 3 | | | | | 5 | 6 |
| 4 | 7 | 8 | | | | |
| 5 | | | 9 | 10 | | |
| 6 | | | | | 11 | 12 |
Obviously the result is not a dataframe or a tibble.
How can such a table be created?
if this is your example data df:
df <- structure(list(ID = 1:6, group = c("A", "B", "C", "A", "B", "C"
), var1 = c(1, 3, 5, 7, 9, 11), var2 = c(2, 4, 6, 8, 10, 12)), class = "data.frame", row.names = c(NA,
-6L))
... you can generate the table structure and column headers like this:
library(tidyr)
df %>%
pivot_longer(cols = starts_with('var'),
names_to = 'var_name',
values_to = 'value'
) %>%
pivot_wider(id_cols = ID,
names_from = c('group', 'var_name'),
names_sep = '\n', ## wrap line after group name
values_from = 'value'
)
Note that AFAIK having the group names span the variable columns would require some separate fiddling between the steps of reshaping your data (see above) and producing the markdown.
Adding on #I_O data transformation, the header for the groups you could achieve with the kableExtra package, i.e.
library(dplyr)
library(tidyr)
library(kableExtra)
options(knitr.kable.NA = '')
df %>%
pivot_longer(cols = starts_with('var'),
names_to = 'var_name',
values_to = 'value'
) %>% pivot_wider(id_cols = ID,
names_from = c('group', 'var_name'),
names_sep = '\n', ## wrap line after group name
values_from = 'value'
) %>%
kbl(col.names = c("ID", "var1", "var2","var1", "var2","var1", "var2")) %>%
add_header_above(c(" ", "groupA" = 2,"groupB" = 2,"groupC" = 2 )) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Using reshape2
library(reshape2)
dcast(
melt(
df,
id.vars=c("grp1","grp2"),
measure.vars=c("var1","var2")
),
grp1~grp2+variable,
value.var="value"
)
grp1 A_var1 A_var2 B_var1 B_var2 C_var1 C_var2
1 1 1 2 NA NA NA NA
2 2 NA NA 3 4 NA NA
3 3 NA NA NA NA 5 6
4 4 7 8 NA NA NA NA
5 5 NA NA 9 10 NA NA
6 6 NA NA NA NA 11 12
There are two separate issues here. One is how to print a hierarchical table in R. There are a few ways to do this, mostly producing latex or html tables. For a hierarchical table printing in the R console, one option is to use tabular from the tables package:
library(tables)
library(dplyr)
fm <- function(x) if(length(x) == 0) "" else x
tabular( (ID) ~ group*(var1 + var2)*(`---`=fm),
data=mutate(df, ID = factor(ID), group = factor(group)))
#>
#> group
#> A B C
#> var1 var2 var1 var2 var1 var2
#> ID --- --- --- --- --- ---
#> 1 1 2
#> 2 3 4
#> 3 5 6
#> 4 7 8
#> 5 9 10
#> 6 11 12
The second, perhaps more important issue is how to store and work with hierarchical tabular structures. This is possible with nested tibbles. In your case, we can do something like:
library(tidyr)
nested_df <- complete(df, ID, group) %>%
nest_by(ID, group) %>%
pivot_wider(names_from = group, values_from = data)
nested_df
#> # A tibble: 6 x 4
#> ID A B C
#> <int> <list<tibble[,2]>> <list<tibble[,2]>> <list<tibble[,2]>>
#> 1 1 [1 x 2] [1 x 2] [1 x 2]
#> 2 2 [1 x 2] [1 x 2] [1 x 2]
#> 3 3 [1 x 2] [1 x 2] [1 x 2]
#> 4 4 [1 x 2] [1 x 2] [1 x 2]
#> 5 5 [1 x 2] [1 x 2] [1 x 2]
#> 6 6 [1 x 2] [1 x 2] [1 x 2]
To access, say, the var1 and var2 columns for group A we would do:
nested_df %>% select(A) %>% unnest(A)
# A tibble: 6 x 2
var1 var2
<dbl> <dbl>
1 1 2
2 NA NA
3 NA NA
4 7 8
5 NA NA
6 NA NA
Created on 2022-05-25 by the reprex package (v2.0.1)

Collapse two dataframes and make an array strcuture

data_1 <- data.frame(V1 = c("123","345","546","890"), V2 = c("J10","K12","R34","J17"),V3=c("N12","M34","W57","Q90"))
data_1
| V1 | V2 | V3 |
|:---- |:------:| -----:|
| 123 | J10 | N12 |
| 345 | K12 | M34 |
| 546 | N12 | R34 |
| 890 | J17 | J10 |
data_2 <- data.frame(V1 = c("123","345","546","890"), V2 = c("01/02/90","10/04/21","09/03/95","29/03/90"),V3=c("28/07/86","16/02/87","17/10/56","14/01/60"))
data_2
| V1 | V2 | V3 |
|:---- |:------:| -----:|
| 123 | 01/02/90 | 28/07/86 |
| 345 | 10/04/21 | 16/02/87 |
| 546 | 09/03/95 | 17/10/56 |
| 890 | 29/03/90 | 14/01/60 |
I would like to have a common first column and collapse the data into a array structure
Result:
| V1 | J10 | N12 | K12 | M34 | R34 | J17 |
|:---- |:----:| :----:| :----: | :----: | :----: | ----:|
| 123 | 01/02/90 | 28/07/86 || | | |
| 345 | | |10/04/21|16/02/87 | | |
| 546 | | 09/03/95 || |17/10/56 | |
| 890 |14/01/60 | || | | 29/03/90 |
We may reshape to 'long' format, bind the datasets and then reshape back to 'wide'
library(dplyr)
library(tidyr)
bind_cols(data_1 %>%
pivot_longer(cols = -V1),
data_2 %>%
pivot_longer(cols = -V1) %>%
select(-V1)) %>%
select(-starts_with('name')) %>%
pivot_wider(names_from = value...3, values_from = value...5)
-output
# A tibble: 4 × 9
V1 J10 N12 K12 M34 R34 W57 J17 Q90
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 123 01/02/90 28/07/86 <NA> <NA> <NA> <NA> <NA> <NA>
2 345 <NA> <NA> 10/04/21 16/02/87 <NA> <NA> <NA> <NA>
3 546 <NA> <NA> <NA> <NA> 09/03/95 17/10/56 <NA> <NA>
4 890 <NA> <NA> <NA> <NA> <NA> <NA> 29/03/90 14/01/60
data_1 <- data.frame(V1 = c("123","345","546","890"), V2 = c("J10","K12","R34","J17"),V3=c("N12","M34","W57","Q90"))
data_2 <- data.frame(V1 = c("123","345","546","890"), V2 = c("01/02/90","10/04/21","09/03/95","29/03/90"),V3=c("28/07/86","16/02/87","17/10/56","14/01/60"))
var_1 <- data.frame( V1= data_1$V1, VAR = data_1$V2, stringsAsFactors = F)
var_2 <- data.frame( V1= data_1$V1, VAR = data_1$V3, stringsAsFactors = F)
var <- bind_rows(var_1,var_2)
date_1 <- data.frame( V1= data_2$V1, DATE = data_2$V2, stringsAsFactors = F)
date_2 <- data.frame( V1= data_2$V1, DATE = data_2$V2, stringsAsFactors = F)
date <- bind_rows(date_1,date_2)
result <- left_join(var, date) %>% mutate_all(as.character) %>% distinct()
result <- result %>% pivot_wider(names_from = VAR, values_from = DATE)
result
V1 J10 K12 R34 J17 N12 M34 W57 Q90
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 123 01/02/90 NA NA NA 01/02/90 NA NA NA
2 345 NA 10/04/21 NA NA NA 10/04/21 NA NA
3 546 NA NA 09/03/95 NA NA NA 09/03/95 NA
4 890 NA NA NA 29/03/90 NA NA NA 29/03/90

Aggregate rows into new column based on common value in another column in R

I have two data frames
df1 is like this
| NOC | 2007 | 2008 |
|:---- |:------:| -----:|
| A | 100 | 5 |
| B | 100 | 5 |
| C | 100 | 5|
| D | 20 | 2 |
| E | 10 | 12 |
| F | 2 | 1 |
df2
| NOC | GROUP |
|:---- |:------:|
| A | aa|
| B | aa |
| C | aa |
| D | bb |
| E | bb |
| F | cc |
I would like to create a new df3 which will aggregate the columns 2007 and 2008 based on Group identity by assigning the sum of rows with the same group identity, so my df3 would look like this
NOC
2007
2008
GROUP
S2007
s2008
A
100
5
aa
300
15
B
100
5
aa
300
15
C
100
5
aa
300
15
D
20
2
bb
30
14
E
10
12
bb
30
14
F
2
1
cc
2
1
my codes are not very efficient, I first merged df1 with df2 by NOC, into df3
df3<-merge(df1, df2, by="NOC",all.x=TRUE)
then used dprl summarised into df4 and created s2007 and s2008
df3 %>%
group_by(GROUP) %>%
summarise(num = n(),
s2017 = sum(2007),s2018 = sum(2008))->df3
then I merged df1 with df3 again to create my final database
I am wondering two problems:
is there a more efficient way?
since my dataframe contains annual data 2007-2030, currently I am writing out the summarize function for each year, is there a faster way of summarize all the columns except NOC?
Thank you!
Before this, a small piece of advice, never name your columns in numeric, it may create you many glitches.
library(tidyverse)
df1 %>% left_join(df2, by = 'NOC') %>%
group_by(GROUP) %>%
mutate(across(c(`2007`, `2008`), ~sum(.), .names = 's.{.col}' ))
# A tibble: 6 x 6
# Groups: GROUP [3]
NOC `2007` `2008` GROUP s.2007 s.2008
<chr> <int> <int> <chr> <int> <int>
1 A 100 5 aa 300 15
2 B 100 5 aa 300 15
3 C 100 5 aa 300 15
4 D 20 2 bb 30 14
5 E 10 12 bb 30 14
6 F 2 1 cc 2 1

Expand data frame and add a new variable

I have a data frame structured like this:
+----------+------+--------+-------+
| Location | year | group1 | Value |
+----------+------+--------+-------+
| a | 2020 | 1 | x |
| a | 2020 | 2 | y |
| a | 2020 | 3 | z |
| a | 2021 | 1 | x |
| a | 2021 | 2 | y |
| a | 2021 | 3 | z |
| b | 2020 | 1 | x |
| b | 2020 | 2 | y |
| b | 2020 | 3 | z |
+----------+------+--------+-------+
I would like to expand the data frame to include 3 rows for every location, year, and group1 combination and generate a group2 variable that identifies these new combinations (1-3). Ideally, the data frame will look like this:
+----------+------+--------+-------+--------+
| Location | year | group1 | Value | group2 |
+----------+------+--------+-------+--------+
| a | 2020 | 1 | x | 1 |
| a | 2020 | 1 | x | 2 |
| a | 2020 | 1 | x | 3 |
| a | 2020 | 2 | y | 1 |
| a | 2020 | 2 | y | 2 |
| a | 2020 | 2 | y | 3 |
| ... | ... |... |... |... |
+----------+------+--------+-------+--------+
I was able to expand the dataframe to the correct number of total rows using the following code:
df[rep(seq_len(nrow(df)),3), 1:4]
But couldn't figure out how to add the group2 variable shown above.
With tidyr you can use expand - this will expand your data frame to all combinations of values with your sequence of 1 to 3:
library(tidyverse)
df %>%
group_by(Location, year, group1, Value) %>%
expand(group2 = 1:3)
Output
Location year group1 Value group2
<fct> <dbl> <int> <fct> <int>
1 a 2020 1 x 1
2 a 2020 1 x 2
3 a 2020 1 x 3
4 a 2020 2 y 1
5 a 2020 2 y 2
6 a 2020 2 y 3
...
Your approach looks close, and I suppose you could just add on group2 like this:
cbind(df[rep(seq_len(nrow(df)), each = 3), ], group2 = 1:3)
Here is the solution you are looking for
library(dplyr)
# 1. Data set
df <- data.table(
location = c("a","a","a","a","a","a","b","b","b"),
year = c(2020,2020,2020,2021,2021,2021,2020,2020,2020),
group1 = c(1,2,3,1,2,3,1,2,3),
value = c("x","y","z","x","y","z","x","y","z"),
stringsAsFactors = FALSE)
# 2. Your code to expand data frame
df <- df[rep(seq_len(nrow(df)), 3), 1:4]
# 3. Arrange
df <- df %>% arrange(location, year, group1, value)
# 4. Add 'group2'
df <- df %>%
group_by(location, year, group1, value) %>%
mutate(group2 = cumsum(group1) / group1) %>%
arrange(location, year, group1, value, group2)
Hope it works
We can use crossing from tidyr
library(tidyr)
library(dplyr)
crossing(df1, group2 = 1:3)
# A tibble: 27 x 5
# Location year group1 Value group2
# <chr> <int> <int> <chr> <int>
# 1 a 2020 1 x 1
# 2 a 2020 1 x 2
# 3 a 2020 1 x 3
# 4 a 2020 2 y 1
# 5 a 2020 2 y 2
# 6 a 2020 2 y 3
# 7 a 2020 3 z 1
# 8 a 2020 3 z 2
# 9 a 2020 3 z 3
#10 a 2021 1 x 1
# … with 17 more rows
Or create a list column and then unnest
df1 %>%
mutate(group2 = list(1:3)) %>%
unnest(c(group2))
data
df1 <- structure(list(Location = c("a", "a", "a", "a", "a", "a", "b",
"b", "b"), year = c(2020L, 2020L, 2020L, 2021L, 2021L, 2021L,
2020L, 2020L, 2020L), group1 = c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L), Value = c("x", "y", "z", "x", "y", "z", "x", "y", "z"
)), class = "data.frame", row.names = c(NA, -9L))

Resources