I have a dataset of patient info, 25 DX codes, and 20 Procedure codes. Each code is in its own column so my table is 45 columns wide just between diagnosis and procedure codes. I need to get the dx codes into two columns DX1-25, and the the code, and Procedure code into and Proc1-proc20 and the codes.
Parsed down original data for ease of use
MRN <- c(1,2,3,4)
DX1 <- c('12','14','16','m78.2')
DX2 <- c('m46.2', 'z98.0', 'z86.711', 'm10.6')
DX3 <- c('m10.7', 'Z86.711', 'M45.1', 'K21.9')
PROC1 <- c(06030, 06020, 06047, 22585)
PROC2 <- c(63020, 63030, 63047, 63030)
PROC3 <- c(22551, 22558, 22528, 22558)
spine_pt_3 <- as.data.frame(cbind(MRN, DX1, DX2, DX3, PROC1,PROC2, PROC3))
Code attempted to get data in desired format
spine3 <- melt(setDT(spine_pt_3),
id = 1,
measure1 = list(2:4),
measure2 = list (5:7),
Variable1= "DX",
variable2 = "Proc"
)
My goal is to get my data to look like this. I'm not sure if this is possible or when I'm going wrong.
data.table
melt(as.data.table(spine_pt_3), id.vars = "MRN",
measure.vars = patterns(DX = "^DX", PROC = "^PROC"),
variable.factor = FALSE)
# MRN variable DX PROC
# <char> <char> <char> <char>
# 1: 1 1 12 6030
# 2: 2 1 14 6020
# 3: 3 1 16 6047
# 4: 4 1 m78.2 22585
# 5: 1 2 m46.2 63020
# 6: 2 2 z98.0 63030
# 7: 3 2 z86.711 63047
# 8: 4 2 m10.6 63030
# 9: 1 3 m10.7 22551
# 10: 2 3 Z86.711 22558
# 11: 3 3 M45.1 22528
# 12: 4 3 K21.9 22558
tidyr
tidyr::pivot_longer(spine_pt_3, -MRN,
names_pattern = "(DX|PROC)(.*)",
names_to = c(".value", "codenum"))
# # A tibble: 12 x 4
# MRN codenum DX PROC
# <chr> <chr> <chr> <chr>
# 1 1 1 12 6030
# 2 1 2 m46.2 63020
# 3 1 3 m10.7 22551
# 4 2 1 14 6020
# 5 2 2 z98.0 63030
# 6 2 3 Z86.711 22558
# 7 3 1 16 6047
# 8 3 2 z86.711 63047
# 9 3 3 M45.1 22528
# 10 4 1 m78.2 22585
# 11 4 2 m10.6 63030
# 12 4 3 K21.9 22558
Related
I have a data table (100 rows x 25 cols) that is structured like this:
ColA ColB ColC ColD
1: 1 3 1 2
2: 2 2 1 2
3: 3 1 1 2
I want to add column values together, in every possible combination.
The output would include, for example:
ColA+B ColA+C ColA+D ColB+C ColB+D etc.
BUT! I don't just want pairs. I am trying to get every combination. I also want to see, for example:
ColA+B+C ColA+B+D ColA+C+D ColB+C+D
And:
ColA+B+C+D
Ideally I could simply add all these permutations to the right of the base dataset (I am looking to do a correlation matrix on all these permutations.) I am far from an R expert. I see there are packages like combinat - but they don't seem to get at what I'm after. I would be very grateful indeed for any suggestions.
Thank you.
I'm hesitant to present this as a suggestion: it works with four columns, but as #DanAdams commented, this explodes with 25 columns:
choose(25,2) # 25 columns, 2 each
# [1] 300
choose(25,3) # 25 columns, 3 each
# [1] 2300
### 25 columns, in sets of 2 through 25 at a time
sum(sapply(2:25, choose, n=25))
# [1] 33554406
But, let's assume that you can control the number of combinations you need. Change 2:4 to be the number of combinations you need.
combs <- do.call(c, lapply(2:4, function(z) asplit(combn(names(dat), z), 2)))
names(combs) <- sapply(combs, paste, collapse = "_")
length(combs)
# [1] 11
combs[c(1,2,10,11)]
# $ColA_ColB
# [1] "ColA" "ColB"
# $ColA_ColC
# [1] "ColA" "ColC"
# $ColB_ColC_ColD
# [1] "ColB" "ColC" "ColD"
# $ColA_ColB_ColC_ColD
# [1] "ColA" "ColB" "ColC" "ColD"
ign <- Map(function(cols, nm) dat[, (nm) := rowSums(.SD), .SDcols = cols], combs, names(combs))
dat[]
# ColA ColB ColC ColD ColA_ColB ColA_ColC ColA_ColD ColB_ColC ColB_ColD ColC_ColD ColA_ColB_ColC ColA_ColB_ColD ColA_ColC_ColD ColB_ColC_ColD ColA_ColB_ColC_ColD
# <int> <int> <int> <int> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 1 3 1 2 4 2 3 4 5 3 5 6 4 6 7
# 2: 2 2 1 2 4 3 4 3 4 3 5 6 5 5 7
# 3: 3 1 1 2 4 4 5 2 3 3 5 6 6 4 7
BTW: I'm inferring that your data is of class data.table, ergo the side-effect I'm using here. If that's not the case, then this is base R:
dat <- cbind(dat, data.frame(lapply(combs, function(cols) rowSums(subset(dat, select = cols)))))
dat
# ColA ColB ColC ColD ColA_ColB ColA_ColC ColA_ColD ColB_ColC ColB_ColD ColC_ColD ColA_ColB_ColC ColA_ColB_ColD ColA_ColC_ColD ColB_ColC_ColD ColA_ColB_ColC_ColD
# 1 1 3 1 2 4 2 3 4 5 3 5 6 4 6 7
# 2 2 2 1 2 4 3 4 3 4 3 5 6 5 5 7
# 3 3 1 1 2 4 4 5 2 3 3 5 6 6 4 7
(Please don't blame me if your R crashes due to memory exhaustion. Save your work often.)
Data
dat <- setDT(structure(list(ColA = 1:3, ColB = 3:1, ColC = c(1L, 1L, 1L), ColD = c(2L, 2L, 2L)), class = c("data.table", "data.frame"), row.names = c(NA, -3L)))
A common type of data set I come across contains several measurements with associated uncertainties combined in each row. Here's an example:
structure(list(meas1 = c(150.3197, 19.95853, 161.40022, 103.23733, 140.28786, 193.42983, 75.237556, 207.84688, 116.4379, 80.251797 ), unc1 = c(0.038140954, 0.09151666, 0.035390881, 0.043274285, 0.03396304, 0.033362432, 0.05290015, 0.035449262, 0.038330437, 0.049171039), meas2 = c(1270.5522, 562.92518, 940.65152, 696.6982, 380.22449, 1979.0521, 1022.01, 1269.7508, 1686.6116, 1256.0033 ), unc2 = c(0.06063558, 0.061388181, 0.060714985, 0.061178737, 0.061318833, 0.060302475, 0.060876815, 0.060659146, 0.060412551, 0.060635459), meas3 = c(601.11331, 1675.2958, 608.84736, 998.76837, 266.2926, 2933.9751, 1682.3191, 775.43699, 428.29473, 1393.6564 ), unc3 = c(0.103445147, 0.102309634, 0.103147224, 0.101772166, 0.104186185, 0.101292496, 0.101556363, 0.102983978, 0.10394405, 0.101598249), ID = 1:10), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
I want to get it in a tidy configuration, like this:
ID meas_type reading uncert
1 1 meas1 150.31970 0.03814095
2 1 meas2 1270.55220 0.06063558
3 1 meas3 601.11331 0.10344515
4 2 meas1 19.95853 0.09151666
5 2 meas2 562.92518 0.06138818
6 2 meas3 1675.29580 0.10230963 ...
I have a work-around, but am wondering if there isn't a pivot_longer() method that would do this more elegantly.
Here's my klugey solution:
df_vals <- df_raw %>%
pivot_longer(cols = c("meas1", "meas2", "meas3"),
names_to = "meas_type",
values_to = "reading")
df_vals <- df_vals[, 4:6]
df_unc <- df_raw %>%
pivot_longer(cols = starts_with("unc"),
values_to = "uncert")
df_unc <- df_unc[, 4:6]
df <- cbind(df_vals, "uncert" = df_unc$uncert)
We can use names_pattern argument of pivot_longer.
tidyr::pivot_longer(df, cols = -ID,
names_to = c(".value", "meas_type"),
names_pattern = "(.*)(\\d+)")
# A tibble: 30 x 4
# ID meas_type meas unc
# <int> <chr> <dbl> <dbl>
# 1 1 1 150. 0.0381
# 2 1 2 1271. 0.0606
# 3 1 3 601. 0.103
# 4 2 1 20.0 0.0915
# 5 2 2 563. 0.0614
# 6 2 3 1675. 0.102
# 7 3 1 161. 0.0354
# 8 3 2 941. 0.0607
# 9 3 3 609. 0.103
#10 4 1 103. 0.0433
# … with 20 more rows
In case you would consider a base R solution, you would need to use a data frame not a tibble but this does what you want..
d <- as.data.frame(d)
reshape(data=d, varying=1:6,
timevar="meas_type",
direction="long",
sep="")
ID meas_type meas unc
1.1 1 1 150.31970 0.03814095
2.1 2 1 19.95853 0.09151666
3.1 3 1 161.40022 0.03539088
4.1 4 1 103.23733 0.04327429
5.1 5 1 140.28786 0.03396304
6.1 6 1 193.42983 0.03336243
We can use melt from data.table
library(data.table)
melt(setDT(df1), measure = patterns("^unc", "meas"),
value.name = c("unc", "meas"), variable.name = "meas_type")
# ID meas_type unc meas
# 1: 1 1 0.03814095 150.31970
# 2: 2 1 0.09151666 19.95853
# 3: 3 1 0.03539088 161.40022
# 4: 4 1 0.04327429 103.23733
# 5: 5 1 0.03396304 140.28786
# 6: 6 1 0.03336243 193.42983
# 7: 7 1 0.05290015 75.23756
# 8: 8 1 0.03544926 207.84688
# 9: 9 1 0.03833044 116.43790
#10: 10 1 0.04917104 80.25180
#11: 1 2 0.06063558 1270.55220
#...
I am trying to expand an existing dataset, which currently looks like this:
df <- tibble(
site = letters[1:3],
years = rep(4, 3),
tr = c(3, 6, 4)
)
tr is the total number of replicates for each site/year combination. I simply want to add in the replicates and later the response variable for each replicate. This was easy for a single site/year combination using the following function:
f <- function(site=NULL, years=NULL, t=NULL){
df <- tibble(
site = rep(site, each = t, times= years),
tr = rep(1:t, times = years),
year = rep(1:years, each = t)
)
df
}
# For one site:
f(site='a', years=4, t=3)
# Producing this:
# # A tibble: 12 x 3
# site tr year
# <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
# 10 a 1 4
# 11 a 2 4
# 12 a 3 4
How can the function be applied to each row of the input dataframe to produce the final dataframe? One of the apply functions in base r or the pmap_df() in the purrr package would seem ideal, but being unfamiliar with how these functions work, all my efforts have only produced errors.
If we want to apply the same function, use pmap
library(purrr)
pmap_dfr(df, ~ f(..1, ..2, ..3))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
another option is condense from the devel version of dplyr
library(tidyr)
df %>%
group_by(rn = row_number()) %>%
condense(out = f(site, years, tr)) %>%
unnest(c(out))
Or in base R, we can also use do.call with Map
do.call(rbind, do.call(Map, c(f, unname(as.data.frame(df)))))
well in base R, you could do:
do.call(rbind,do.call(Vectorize(f,SIMPLIFY = FALSE),unname(df)))
# A tibble: 52 x 3
site tr year
* <chr> <int> <int>
1 a 1 1
2 a 2 1
3 a 3 1
4 a 1 2
5 a 2 2
6 a 3 2
7 a 1 3
8 a 2 3
9 a 3 3
10 a 1 4
# ... with 42 more rows
do.call(rbind, lapply(split(df, df$site), function(x){
with(x, data.frame(site,
years = rep(sequence(years), each = tr),
tr = rep(sequence(tr), years)))
}))
We can use Map to apply f to every value of site, years and tr.
do.call(rbind, Map(f, df$site, df$years, df$tr))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
Akrun's answer worked well for me, so I modified it to make the function to be applied to each row of the dataframe a little more explicit:
df1 <- pmap_df(df, function(site, years, tr){
site = rep(site, each = tr, times=years)
year = rep(1:years, each = tr)
tr = rep(1:tr, times=years)
return(tibble(site, year, tr))
})
I want to convert a list of named lists to a data frame, where some have missing columns. I can do that successfully with the deprecated rbind_all but not with the replacement bind_rows
Example
List with missing Columns (el3 missing b)
ex = list(el1=c(a=1, b=2, c=3), el2=c(a=2, b=3, c=4), el3=c(a=3, c=5))
rbind_all(ex)
# A tibble: 3 x 3
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 2 3 4
3 3 NA 5
> bind_rows(ex)
Error in bind_rows_(x, .id) : Argument 3 must be length 3, not 2
Without missing columns
ex2 = list(el1=c(a=1, b=2, c=3), el2=c(a=2, b=3, c=4), el3=c(a=3, b=4, c=5))
rbind_all(ex2)
# A tibble: 3 x 3
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 2 3 4
3 3 4 5
bind_rows(ex2) # Output is transposed for some reason
# A tibble: 3 x 3
el1 el2 el3
<dbl> <dbl> <dbl>
1 1 2 3
2 2 3 4
3 3 4 5
How to replicate rbind_all behavior with a non-deprecated function?
Please read this example in ?bind_rows:
# Note that for historical reasons, lists containg vectors are
# always treated as data frames. Thus their vectors are treated as
# columns rather than rows, and their inner names are ignored:
ll <- list(
a = c(A = 1, B = 2),
b = c(A = 3, B = 4)
)
bind_rows(ll)
# You can circumvent that behaviour with explicit splicing:
bind_rows(!!!ll)
Therefore, in your case, you can try:
ex = list(el1=c(a=1, b=2, c=3), el2=c(a=2, b=3, c=4), el3=c(a=3, c=5))
bind_rows(!!!ex)
# # A tibble: 3 x 3
# a b c
# <dbl> <dbl> <dbl>
# 1 1 2 3
# 2 2 3 4
# 3 3 NA 5
ex2 = list(el1=c(a=1, b=2, c=3), el2=c(a=2, b=3, c=4), el3=c(a=3, b=4, c=5))
bind_rows(!!!ex2)
# # A tibble: 3 x 3
# a b c
# <dbl> <dbl> <dbl>
# 1 1 2 3
# 2 2 3 4
# 3 3 4 5
Here is a workaround that using map_dfr from the purrr package.
library(dplyr)
library(purrr)
map_dfr(ex, ~as_tibble(t(.)))
# # A tibble: 3 x 3
# a b c
# <dbl> <dbl> <dbl>
# 1 1 2 3
# 2 2 3 4
# 3 3 NA 5
I need to add a new row to each id group where the key= "n" and value is the total - a + b
x <- data_frame( id = c(1,1,1,2,2,2,2),
key = c("a","b","total","a","x","b","total"),
value = c(1,2,10,4,1,3,12) )
# A tibble: 7 × 3
id key value
<dbl> <chr> <dbl>
1 1 a 1
2 1 b 2
3 1 total 10
4 2 a 4
5 2 x 1
6 2 b 3
7 2 total 12
In this example, the new rows should be
1 n 7
2 n 5
I tried getting the a+b subtotal and joining that to the total count to get the difference, but after using nine dplyr verbs I seem to be going in the wrong direction. Thanks.
This isn't a join, it's just binding new rows on:
x %>% group_by(id) %>%
summarize(
value = sum(value[key == 'total']) - sum(value[key %in% c('a', 'b')]),
key = 'n'
) %>%
bind_rows(x) %>%
select(id, key, value) %>% # back to original column order
arrange(id, key) # and a start a row order
# # A tibble: 9 × 3
# id key value
# <dbl> <chr> <dbl>
# 1 1 a 1
# 2 1 b 2
# 3 1 n 7
# 4 1 total 10
# 5 2 a 4
# 6 2 b 3
# 7 2 n 5
# 8 2 total 12
# 9 2 x 1
Here's a way using data.table, binding rows as in Gregor's answer:
library(data.table)
setDT(x)
dcast(x, id ~ key)[, .(id, key = "n", value = total - a - b)][, rbind(.SD, x)][order(id)]
id key value
1: 1 n 7
2: 1 a 1
3: 1 b 2
4: 1 total 10
5: 2 n 5
6: 2 a 4
7: 2 x 1
8: 2 b 3
9: 2 total 12