I have a data table (100 rows x 25 cols) that is structured like this:
ColA ColB ColC ColD
1: 1 3 1 2
2: 2 2 1 2
3: 3 1 1 2
I want to add column values together, in every possible combination.
The output would include, for example:
ColA+B ColA+C ColA+D ColB+C ColB+D etc.
BUT! I don't just want pairs. I am trying to get every combination. I also want to see, for example:
ColA+B+C ColA+B+D ColA+C+D ColB+C+D
And:
ColA+B+C+D
Ideally I could simply add all these permutations to the right of the base dataset (I am looking to do a correlation matrix on all these permutations.) I am far from an R expert. I see there are packages like combinat - but they don't seem to get at what I'm after. I would be very grateful indeed for any suggestions.
Thank you.
I'm hesitant to present this as a suggestion: it works with four columns, but as #DanAdams commented, this explodes with 25 columns:
choose(25,2) # 25 columns, 2 each
# [1] 300
choose(25,3) # 25 columns, 3 each
# [1] 2300
### 25 columns, in sets of 2 through 25 at a time
sum(sapply(2:25, choose, n=25))
# [1] 33554406
But, let's assume that you can control the number of combinations you need. Change 2:4 to be the number of combinations you need.
combs <- do.call(c, lapply(2:4, function(z) asplit(combn(names(dat), z), 2)))
names(combs) <- sapply(combs, paste, collapse = "_")
length(combs)
# [1] 11
combs[c(1,2,10,11)]
# $ColA_ColB
# [1] "ColA" "ColB"
# $ColA_ColC
# [1] "ColA" "ColC"
# $ColB_ColC_ColD
# [1] "ColB" "ColC" "ColD"
# $ColA_ColB_ColC_ColD
# [1] "ColA" "ColB" "ColC" "ColD"
ign <- Map(function(cols, nm) dat[, (nm) := rowSums(.SD), .SDcols = cols], combs, names(combs))
dat[]
# ColA ColB ColC ColD ColA_ColB ColA_ColC ColA_ColD ColB_ColC ColB_ColD ColC_ColD ColA_ColB_ColC ColA_ColB_ColD ColA_ColC_ColD ColB_ColC_ColD ColA_ColB_ColC_ColD
# <int> <int> <int> <int> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 1 3 1 2 4 2 3 4 5 3 5 6 4 6 7
# 2: 2 2 1 2 4 3 4 3 4 3 5 6 5 5 7
# 3: 3 1 1 2 4 4 5 2 3 3 5 6 6 4 7
BTW: I'm inferring that your data is of class data.table, ergo the side-effect I'm using here. If that's not the case, then this is base R:
dat <- cbind(dat, data.frame(lapply(combs, function(cols) rowSums(subset(dat, select = cols)))))
dat
# ColA ColB ColC ColD ColA_ColB ColA_ColC ColA_ColD ColB_ColC ColB_ColD ColC_ColD ColA_ColB_ColC ColA_ColB_ColD ColA_ColC_ColD ColB_ColC_ColD ColA_ColB_ColC_ColD
# 1 1 3 1 2 4 2 3 4 5 3 5 6 4 6 7
# 2 2 2 1 2 4 3 4 3 4 3 5 6 5 5 7
# 3 3 1 1 2 4 4 5 2 3 3 5 6 6 4 7
(Please don't blame me if your R crashes due to memory exhaustion. Save your work often.)
Data
dat <- setDT(structure(list(ColA = 1:3, ColB = 3:1, ColC = c(1L, 1L, 1L), ColD = c(2L, 2L, 2L)), class = c("data.table", "data.frame"), row.names = c(NA, -3L)))
Related
I have a dataset of patient info, 25 DX codes, and 20 Procedure codes. Each code is in its own column so my table is 45 columns wide just between diagnosis and procedure codes. I need to get the dx codes into two columns DX1-25, and the the code, and Procedure code into and Proc1-proc20 and the codes.
Parsed down original data for ease of use
MRN <- c(1,2,3,4)
DX1 <- c('12','14','16','m78.2')
DX2 <- c('m46.2', 'z98.0', 'z86.711', 'm10.6')
DX3 <- c('m10.7', 'Z86.711', 'M45.1', 'K21.9')
PROC1 <- c(06030, 06020, 06047, 22585)
PROC2 <- c(63020, 63030, 63047, 63030)
PROC3 <- c(22551, 22558, 22528, 22558)
spine_pt_3 <- as.data.frame(cbind(MRN, DX1, DX2, DX3, PROC1,PROC2, PROC3))
Code attempted to get data in desired format
spine3 <- melt(setDT(spine_pt_3),
id = 1,
measure1 = list(2:4),
measure2 = list (5:7),
Variable1= "DX",
variable2 = "Proc"
)
My goal is to get my data to look like this. I'm not sure if this is possible or when I'm going wrong.
data.table
melt(as.data.table(spine_pt_3), id.vars = "MRN",
measure.vars = patterns(DX = "^DX", PROC = "^PROC"),
variable.factor = FALSE)
# MRN variable DX PROC
# <char> <char> <char> <char>
# 1: 1 1 12 6030
# 2: 2 1 14 6020
# 3: 3 1 16 6047
# 4: 4 1 m78.2 22585
# 5: 1 2 m46.2 63020
# 6: 2 2 z98.0 63030
# 7: 3 2 z86.711 63047
# 8: 4 2 m10.6 63030
# 9: 1 3 m10.7 22551
# 10: 2 3 Z86.711 22558
# 11: 3 3 M45.1 22528
# 12: 4 3 K21.9 22558
tidyr
tidyr::pivot_longer(spine_pt_3, -MRN,
names_pattern = "(DX|PROC)(.*)",
names_to = c(".value", "codenum"))
# # A tibble: 12 x 4
# MRN codenum DX PROC
# <chr> <chr> <chr> <chr>
# 1 1 1 12 6030
# 2 1 2 m46.2 63020
# 3 1 3 m10.7 22551
# 4 2 1 14 6020
# 5 2 2 z98.0 63030
# 6 2 3 Z86.711 22558
# 7 3 1 16 6047
# 8 3 2 z86.711 63047
# 9 3 3 M45.1 22528
# 10 4 1 m78.2 22585
# 11 4 2 m10.6 63030
# 12 4 3 K21.9 22558
I often find questions where people have somehow ended up with an unnamed list of unnamed character vectors and they want to bind them row-wise into a data.frame. Here is an example:
library(magrittr)
data <- cbind(LETTERS[1:3],1:3,4:6,7:9,c(12,15,18)) %>%
split(1:3) %>% unname
data
#[[1]]
#[1] "A" "1" "4" "7" "12"
#
#[[2]]
#[1] "B" "2" "5" "8" "15"
#
#[[3]]
#[1] "C" "3" "6" "9" "18"
One typical approach is with do.call from base R.
do.call(rbind, data) %>% as.data.frame
# V1 V2 V3 V4 V5
#1 A 1 4 7 12
#2 B 2 5 8 15
#3 C 3 6 9 18
Perhaps a less efficient approach is with Reduce from base R.
Reduce(rbind,data, init = NULL) %>% as.data.frame
# V1 V2 V3 V4 V5
#1 A 1 4 7 12
#2 B 2 5 8 15
#3 C 3 6 9 18
However, when we consider more modern packages such as dplyr or data.table, some of the approaches that might immediately come to mind don't work because the vectors are unnamed or aren't a list.
library(dplyr)
bind_rows(data)
#Error: Argument 1 must have names
library(data.table)
rbindlist(data)
#Error in rbindlist(data) :
# Item 1 of input is not a data.frame, data.table or list
One approach might be to set_names on the vectors.
library(purrr)
map_df(data, ~set_names(.x, seq_along(.x)))
# A tibble: 3 x 5
# `1` `2` `3` `4` `5`
# <chr> <chr> <chr> <chr> <chr>
#1 A 1 4 7 12
#2 B 2 5 8 15
#3 C 3 6 9 18
However, this seems like more steps than it needs to be.
Therefore, my question is what is an efficient tidyverse or data.table approach to binding an unnamed list of unnamed character vectors into a data.frame row-wise?
Not entirely sure about efficiency, but a compact option using purrr and tibble could be:
map_dfc(purrr::transpose(data), ~ unlist(tibble(.)))
V1 V2 V3 V4 V5
<chr> <chr> <chr> <chr> <chr>
1 A 1 4 7 12
2 B 2 5 8 15
3 C 3 6 9 18
Edit
Use #sindri_baldur's approach: https://stackoverflow.com/a/61660119/8583393
A way with data.table, similar to what #tmfmnk showed
library(data.table)
as.data.table(transpose(data))
# V1 V2 V3 V4 V5
#1: A 1 4 7 12
#2: B 2 5 8 15
#3: C 3 6 9 18
library(data.table)
setDF(transpose(data))
V1 V2 V3 V4 V5
1 A 1 4 7 12
2 B 2 5 8 15
3 C 3 6 9 18
This seems rather compact. I believe this is what powers bind_rows() from dplyr and therefore map_df() in purrr, so should be fairly efficient.
library(vctrs)
vec_rbind(!!!data)
This gives a data.frame.
...1 ...2 ...3 ...4 ...5
1 A 1 4 7 12
2 B 2 5 8 15
3 C 3 6 9 18
Some Benchmarks
It seems like the .name_repair within the tidyverse methods is a severe bottleneck. I took a few fairly straightforward options that also seemed to run the quickest from the other posts (thanks H 1 and sindri_baldur).
microbenchmark(vctrs = vec_rbind(!!!data),
dt = rbindlist(lapply(data, as.list)),
map = map_df(data, as_tibble_row, .name_repair = "unique"),
base = as.data.frame(do.call(rbind, data)))
But if you first name the vectors (but not necessarily the list elements), you get a different story.
data2 <- modify(data, ~set_names(.x, seq(.x)))
microbenchmark(vctrs = vec_rbind(!!!data2),
dt = rbindlist(lapply(data2, as.list)),
map = map_df(data2, as_tibble_row),
base = as.data.frame(do.call(rbind, data2)))
In fact, you can include the time to name the vectors into the vec_rbind() solution and not the others, and still see fairly high performance.
microbenchmark(vctrs = vec_rbind(!!!modify(data, ~set_names(.x, seq(.x)))),
dt = setDF(transpose(data)),
map = map_df(data2, as_tibble_row),
base = as.data.frame(do.call(rbind, data)))
For what its worth.
My approach would be to just turn those list entries into expected type
rbindlist(lapply(data, as.list))
# V1 V2 V3 V4 V5
# <char> <char> <char> <char> <char>
#1: A 1 4 7 12
#2: B 2 5 8 15
#3: C 3 6 9 18
If you want your data types to be adjusted from character vector to appropriate types, then lapply can help here as well. First lapply is called for every row, second lapply is called for every column.
rbindlist(lapply(data, as.list))[, lapply(.SD, type.convert)]
V1 V2 V3 V4 V5
<fctr> <int> <int> <int> <int>
1: A 1 4 7 12
2: B 2 5 8 15
3: C 3 6 9 18
An option with unnest_wider
library(tibble)
library(tidyr)
library(stringr)
tibble(col = data) %>%
unnest_wider(c(col), names_repair = ~ str_c('value', seq_along(.)))
# A tibble: 3 x 5
# value1 value2 value3 value4 value5
# <chr> <chr> <chr> <chr> <chr>
#1 A 1 4 7 12
#2 B 2 5 8 15
#3 C 3 6 9 18
Here is a slight variation on tmfmnk's suggested approach using as_tibble_row() to convert the vectors into single row tibbles. It's also necessary to use the .name_repair argument:
library(purrr)
library(tibble)
map_df(data, as_tibble_row, .name_repair = ~paste0("value", seq(.x)))
# A tibble: 3 x 5
value1 value2 value3 value4 value5
<chr> <chr> <chr> <chr> <chr>
1 A 1 4 7 12
2 B 2 5 8 15
3 C 3 6 9 18
I think this could be added to an already complete set of very good answers to this question:
library(rlang) # Or purrr
data %>%
exec(rbind, !!!.) %>%
as_tibble() %>%
set_names(~ letters[seq_along(.)])
# A tibble: 3 x 5
a b c d e
<chr> <chr> <chr> <chr> <chr>
1 A 1 4 7 12
2 B 2 5 8 15
3 C 3 6 9 18
I am trying to expand an existing dataset, which currently looks like this:
df <- tibble(
site = letters[1:3],
years = rep(4, 3),
tr = c(3, 6, 4)
)
tr is the total number of replicates for each site/year combination. I simply want to add in the replicates and later the response variable for each replicate. This was easy for a single site/year combination using the following function:
f <- function(site=NULL, years=NULL, t=NULL){
df <- tibble(
site = rep(site, each = t, times= years),
tr = rep(1:t, times = years),
year = rep(1:years, each = t)
)
df
}
# For one site:
f(site='a', years=4, t=3)
# Producing this:
# # A tibble: 12 x 3
# site tr year
# <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
# 10 a 1 4
# 11 a 2 4
# 12 a 3 4
How can the function be applied to each row of the input dataframe to produce the final dataframe? One of the apply functions in base r or the pmap_df() in the purrr package would seem ideal, but being unfamiliar with how these functions work, all my efforts have only produced errors.
If we want to apply the same function, use pmap
library(purrr)
pmap_dfr(df, ~ f(..1, ..2, ..3))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
another option is condense from the devel version of dplyr
library(tidyr)
df %>%
group_by(rn = row_number()) %>%
condense(out = f(site, years, tr)) %>%
unnest(c(out))
Or in base R, we can also use do.call with Map
do.call(rbind, do.call(Map, c(f, unname(as.data.frame(df)))))
well in base R, you could do:
do.call(rbind,do.call(Vectorize(f,SIMPLIFY = FALSE),unname(df)))
# A tibble: 52 x 3
site tr year
* <chr> <int> <int>
1 a 1 1
2 a 2 1
3 a 3 1
4 a 1 2
5 a 2 2
6 a 3 2
7 a 1 3
8 a 2 3
9 a 3 3
10 a 1 4
# ... with 42 more rows
do.call(rbind, lapply(split(df, df$site), function(x){
with(x, data.frame(site,
years = rep(sequence(years), each = tr),
tr = rep(sequence(tr), years)))
}))
We can use Map to apply f to every value of site, years and tr.
do.call(rbind, Map(f, df$site, df$years, df$tr))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
Akrun's answer worked well for me, so I modified it to make the function to be applied to each row of the dataframe a little more explicit:
df1 <- pmap_df(df, function(site, years, tr){
site = rep(site, each = tr, times=years)
year = rep(1:years, each = tr)
tr = rep(1:tr, times=years)
return(tibble(site, year, tr))
})
Have:
> aDT <- data.table(ID = c(3,3,2,2,2,3), colA = c(5,5,4,4,4,5), colC = c(1:6))
> aDT
ID colA colC
1: 3 5 1
2: 3 5 2
3: 2 4 3
4: 2 4 4
5: 2 4 5
6: 3 5 6
Need:
> aDT <- data.table(ID = c(3,2,3), colA = c(5,4,5), colC = c(2,5,6))
> aDT
ID colA colC
1: 3 5 2
2: 2 4 5
3: 3 5 6
Tried:
> aDT[, .SD[.N], by = list(ID,colA)]
ID colA colC
1: 3 5 6
2: 2 4 5
As you can see, the result's not really what I need. How to fix it?
(btw, I would like to retain the same order)
You are not really grouping by ID and colA but by the consecutive chunks, for which you can use rleid for this purpose:
aDT[aDT[, .I[.N], rleid(ID, colA)]$V1]
# ID colA colC
#1: 3 5 2
#2: 2 4 5
#3: 3 5 6
.I[.N] extracts the global row number of the last row for each group:
aDT[, .I[.N], rleid(ID, colA)]
# rleid V1
#1: 1 2
#2: 2 5
#3: 3 6 there are three groups in total, the row numbers of last rows are 2,5,6
then use the row numbers to subset the original data table.
I need to add a new row to each id group where the key= "n" and value is the total - a + b
x <- data_frame( id = c(1,1,1,2,2,2,2),
key = c("a","b","total","a","x","b","total"),
value = c(1,2,10,4,1,3,12) )
# A tibble: 7 × 3
id key value
<dbl> <chr> <dbl>
1 1 a 1
2 1 b 2
3 1 total 10
4 2 a 4
5 2 x 1
6 2 b 3
7 2 total 12
In this example, the new rows should be
1 n 7
2 n 5
I tried getting the a+b subtotal and joining that to the total count to get the difference, but after using nine dplyr verbs I seem to be going in the wrong direction. Thanks.
This isn't a join, it's just binding new rows on:
x %>% group_by(id) %>%
summarize(
value = sum(value[key == 'total']) - sum(value[key %in% c('a', 'b')]),
key = 'n'
) %>%
bind_rows(x) %>%
select(id, key, value) %>% # back to original column order
arrange(id, key) # and a start a row order
# # A tibble: 9 × 3
# id key value
# <dbl> <chr> <dbl>
# 1 1 a 1
# 2 1 b 2
# 3 1 n 7
# 4 1 total 10
# 5 2 a 4
# 6 2 b 3
# 7 2 n 5
# 8 2 total 12
# 9 2 x 1
Here's a way using data.table, binding rows as in Gregor's answer:
library(data.table)
setDT(x)
dcast(x, id ~ key)[, .(id, key = "n", value = total - a - b)][, rbind(.SD, x)][order(id)]
id key value
1: 1 n 7
2: 1 a 1
3: 1 b 2
4: 1 total 10
5: 2 n 5
6: 2 a 4
7: 2 x 1
8: 2 b 3
9: 2 total 12