New data.table columnS based on grouping and function of multiple columns - r

Let's say I have a data.frame
sample_df = structure(list(AE = c(148, 1789, 1223, 260, 1825, 37, 1442, 484,
10, 163, 1834, 254, 445, 837, 721, 1904, 1261, 382, 139, 213),
FW = structure(c(1L, 3L, 2L, 3L, 3L, 1L, 2L, 3L, 2L, 2L,
3L, 2L, 3L, 2L, 1L, 3L, 1L, 1L, 1L, 3L), .Label = c("LYLR",
"OCXG", "BIYX"), class = "factor"), CP = c("WYB/NXO", "HUK/NXO",
"HUK/WYB", "HUK/NXO", "WYB/NXO", "HUK/WYB", "HUK/NXO", "HUK/NXO",
"WYB/NXO", "HUK/NXO", "WYB/NXO", "HUK/NXO", "HUK/WYB", "WYB/NXO",
"HUK/WYB", "WYB/NXO", "WYB/NXO", "HUK/WYB", "WYB/NXO", "WYB/NXO"
), SD = c(1, 1, -1, 1, 1, 1, 1, -1, 1, 1, -1, -1, 1, -1,
-1, 1, -1, 1, 1, 1)), .Names = c("AE", "FW", "CP", "SD"), row.names = c(NA, -20L), class = "data.frame")
Or in human readable format:
AE FW CP SD
1 148 LYLR WYB/NXO 1
2 1789 BIYX HUK/NXO 1
3 1223 OCXG HUK/WYB -1
4 260 BIYX HUK/NXO 1
5 1825 BIYX WYB/NXO 1
6 37 LYLR HUK/WYB 1
7 1442 OCXG HUK/NXO 1
8 484 BIYX HUK/NXO -1
9 10 OCXG WYB/NXO 1
10 163 OCXG HUK/NXO 1
11 1834 BIYX WYB/NXO -1
12 254 OCXG HUK/NXO -1
13 445 BIYX HUK/WYB 1
14 837 OCXG WYB/NXO -1
15 721 LYLR HUK/WYB -1
16 1904 BIYX WYB/NXO 1
17 1261 LYLR WYB/NXO -1
18 382 LYLR HUK/WYB 1
19 139 LYLR WYB/NXO 1
20 213 BIYX WYB/NXO 1
now suppose that for each unique value (fw,cp) of (FW,CP), I would like to get
sum of all values of AE for (FW,CP)=(fw,cp)
mean of all values of SD for (FW,CP)=(fw,cp)
In R, one could do something like:
unique_keys <- unique(sample_df[,c('FW','CP')])
slow_version <- function(ind, sample_df, unique_keys){
index <- which(sample_df$FW == unique_keys$FW[ind] & sample_df$CP == unique_keys$CP[ind])
c(ind = ind,
sum_ae = sum(sample_df$AE[index]),
min_ae = mean(sample_df$SD[index]))
}
intermed_result <- t(sapply(1:nrow(unique_keys), slow_version,
sample_df = sample_df,
unique_keys = unique_keys))
colnames(intermed_result) <- c('ind','sum','mean')
result <- data.frame(unique_keys[intermed_result[, 'ind'], ],
'sum' = intermed_result[,'sum'],
'mean' = intermed_result[,'mean'])
but this gets pretty slow as the size of data_frame grows.
Thanks to this answer, I suspect it is possible to use data.table magic to get the same result fastly. But doing:
library(data.table)
sample_dt = data.table(sample_df)
setkey(sample_dt, FW, CP)
f <- function(AE, SD) {list('sum' = sum(AE), 'mean' = mean(SD))}
sample_dt[,c("col1","col2"):=f(AE, SD), by=.(FW, CP)][]
does not yield the desired result. What is the correct way?

I would try:
library(data.table)
sample_dt = data.table(data_frame)
setkey(sample_dt, FW, CP)
f <- function(AE, SD) {list('sum' = sum(AE), 'mean' = mean(SD))}
sample_dt[, f(AE, SD), by=.(FW, CP)]
# FW CP sum mean
# 1: LYLR HUK/WYB 1140 0.3333333
# 2: LYLR WYB/NXO 1548 0.3333333
# 3: OCXG HUK/NXO 1859 0.3333333
# 4: OCXG HUK/WYB 1223 -1.0000000
# 5: OCXG WYB/NXO 847 0.0000000
# 6: BIYX HUK/NXO 2533 0.3333333
# 7: BIYX HUK/WYB 445 1.0000000
# 8: BIYX WYB/NXO 5776 0.5000000
you didn't get desired output because you assign the resulting sum and mean columns by group to original data.table with :=. However, I also prefer the syntax suggested by Frank, which should be the right way to go. For our current named list approach, when adding verbose = T, it says:
Making each group and running j (GForce FALSE) ... The result of j is
a named list. It's very inefficient to create the same names over and
over again for each group. When j=list(...), any names are detected,
removed and put back after grouping has completed, for efficiency.
Using j=transform(), for example, prevents that speedup (consider
changing to :=). This message may be upgraded to warning in future.
When we have many groups and the function in j are basic functions like mean and sd, using
sample_dt2[, .(sum.AE = sum(AE), mean.SD = mean(SD)), by=.(FW, CP)]
would be very fast, becaused those functions are replaced with GForce functions like gmean internally. see ?GForce and the benchmark of Frank for more information.

Related

How to join two dataframes containing time varying variables in R

This seems like a simple enough thing but I can't figure it out nor find an answer online - apologies if it something obvious. I have two seperate dataframes containing the same patients with the same unique identifier. Both datasets have time varying variables - one continuous and one categorical and the time to each reading is different in the sets but have a common start point at time 1. I have tried to modify the tmerge function from survival package but without luck as I don't have a dichotomous outcome variable nor a single data set with one row per patient.
Reprex for creating the datasets below (df1 and df2) and an example of my desired combined output table for a single patient (ID 3), output gets very long if done for all 4 patients
Thanks for any possible help
df1 <- structure(list(tstart = c(1, 1, 1, 1426, 1, 560, 567), tstop = c(2049,
3426, 1426, 1707, 560, 567, 4207), category = structure(c(1L,
1L, 1L, 2L, 1L, 4L, 2L), .Label = c("none", "high", "low", "moderate"
), class = "factor"), id = c(1L, 2L, 3L, 3L, 4L, 4L, 4L)), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
df2 <- structure(list(tstart = c(1, 365, 730, 1, 365, 730, 1096, 2557,
1, 365, 730, 1096, 1826, 2557, 3652, 1), tstop = c(365, 730,
1096, 365, 730, 1096, 2557, 2582, 365, 730, 1096, 1826, 2557,
3652, 4864, 365), egfr = c(66, 62, 58, 54, 50, 43, 49, 51, 106,
103, 80, 92, 97, 90, 81, 51), id = c(1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 4L)), row.names = c(NA, -16L), class = c("tbl_df",
"tbl", "data.frame"))
df_example_patient_3 <- structure(list(id = c(3L, 3L, 3L,
3L, 3L, 3L,3L, 3L, 3L), tstart = c(1, 365, 730, 1096, 1426, 1707, 1826, 2557, 3652), tstop = c(365, 730,
1096, 1426, 1707, 1826, 2557, 3652, 4864), egfr = c(106, 103, 80, 92, 92, 92, 97, 90, 81), category = c("none", "none", "none", "none", "high", "high", "high", "high", "high")), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"))
# DF1
tstart tstop category id
<dbl> <dbl> <fct> <int>
1 1 2049 none 1
2 1 3426 none 2
3 1 1426 none 3
4 1426 1707 high 3
5 1 560 none 4
6 560 567 moderate 4
7 567 4207 high 4
# DF2
tstart tstop egfr id
<dbl> <dbl> <dbl> <int>
1 1 365 66 1
2 365 730 62 1
3 730 1096 58 1
4 1 365 54 2
5 365 730 50 2
6 730 1096 43 2
7 1096 2557 49 2
8 2557 2582 51 2
9 1 365 106 3
10 365 730 103 3
11 730 1096 80 3
12 1096 1826 92 3
13 1826 2557 97 3
14 2557 3652 90 3
15 3652 4864 81 3
16 1 365 51 4
# Combined set
id tstart tstop egfr category
<int> <dbl> <dbl> <dbl> <chr>
1 3 1 365 106 none
2 3 365 730 103 none
3 3 730 1096 80 none
4 3 1096 1426 92 none
5 3 1426 1707 92 high
6 3 1707 1826 92 high
7 3 1826 2557 97 high
8 3 2557 3652 90 high
9 3 3652 4864 81 high
I had to do it this way to really work out the details.
First, i construct a full df1 with all the timestamps, including those of df2.
then i proceed with multiple merges. This is not elegant, but it works:
library(data.table)
library(zoo)
# Proper data.tables
setDT(df1, key = c("id", "tstart"))
setDT(df2, key = c("id", "tstart"))
timestamps_by_id <- unique(rbind(
df1[, .(id, tstart)],
df1[, .(id, tstop)],
df2[, .(id, tstart)],
df2[, .(id, tstop)],
use.names = F
))
setorder(timestamps_by_id, id, tstart)
# Merge to construct full df1
df1_full <- df1[timestamps_by_id]
df1_full[, category := na.locf(category), by = id]
df1_full[, tstop := shift(tstart, -1), by = id]
setkey(df1_full, id, tstart)
# Merge with df2
result <- na.omit(df2[df1_full, roll = T])
result[, tstop := i.tstop]
print(result[id == 3, .(id, tstart, tstop, egfr, category)])
Or a more data.tabley solution using the more arcane foverlaps:
library(data.table)
# Proper data.tables
setDT(df1, key = c("id", "tstart", "tstop"))
setDT(df2, key = c("id", "tstart", "tstop"))
# We add an infinite upper range
proper_df1 <- rbind(
df1,
df1[, .SD[which.max(tstop)], by = .(id)][, .(id, tstart = tstop, tstop = Inf, category), ]
)
setkey(proper_df1, id, tstart, tstop)
overlaps <- foverlaps(df2, proper_df1, type = "any") # Overlap join
overlaps[
tstart %between% .(i.tstart, i.tstop) & tstart != 1,
i.tstart := tstart
]
overlaps[tstop %between% .(i.tstart, i.tstop), i.tstop := tstop]
print(overlaps[
id == 3,
.(id, "tstart" = i.tstart, "tstop" = i.tstop, category, egfr)
])
This messy dplyr solution seems to work for this particular dataset but don't know would it work for all datasets, the direction of the fill may need to be altered depending on particular dataset
library(tidyverse)
library(magrittr)
df1 %>%
bind_rows(df2) %>%
group_by(id) %>%
arrange(id, tstop) %>%
mutate(
tstart = case_when(
tstart < lag(tstop) ~ lag(tstop), TRUE ~ tstart)) %>%
fill(egfr, category, .direction = "updown") %>%
ungroup() %>%
filter(id == 3)
tstart tstop category id egfr
<dbl> <dbl> <fct> <int> <dbl>
1 1 365 none 3 106
2 365 730 none 3 103
3 730 1096 none 3 80
4 1096 1426 none 3 92
5 1426 1707 high 3 92
6 1707 1826 high 3 92
7 1826 2557 high 3 97
8 2557 3652 high 3 90
9 3652 4864 high 3 81

Filter data frame based on numeric vector with "tolerance"

I would like to filter data frame using numeric vector. I am applying function below:
test_data <- exp_data[exp_data$Size_Change %in% vec_data,]
That's how example data looks like:
dput(exp_data)
structure(list(Name = c("Mark", "Greg", "Tomas", "Morka", "Pekka",
"Robert", "Tim", "Tom", "Bobby", "Terka"), Mode = c(1, 2, NA,
4, NA, 3, NA, 1, NA, 3), Change = structure(c(6L, 2L, 4L, 5L,
7L, 7L, 7L, 8L, 3L, 1L), .Label = c("D[+58], I[+12][+385]", "C[+58], K[+1206]",
"C[+58], P[+2074]", "C[+58], K[+2172]", "C[+58], K[+259]", "C[+58], K[+2665]",
"C[+58], T[+385]", "C[+58], C[+600]"), class = "factor"), Size = c(1335.261,
697.356, 1251.603, 920.43, 492.236, 393.991, 492.239, 727.696,
1218.933, 495.237), Place = c(3L, 4L, 3L, 2L, 4L, 5L, 4L, 3L,
3L, 4L), Size_Change = c(4004, 2786, 3753, 1840, 1966, 1966,
1966, 2181, 3655, 1978)), row.names = 2049:2058, class = "data.frame")
and vector used for filtering:
dput(vec_data)
c(4003, 2785, 954, 1129, 4013, 756, 1852, 2424, 1954, 246, 147,
234, 562, 1617, 2180, 888, 1176)
I mentioned about tolerance because vec_data is not very precise and I am expecting +1/-1 difference in numbers and after applying function it will not filter rows with such difference. It may also happen that difference will be +12/-12 or +24/-24. Can I somehow take it into account while filtering ?
Of course probably solution is to do smth like that (vec_data +1) / (vec_data -1) / (vec_data +12), etc. and do couple of filtering attempts and maybe finally rbind outputs of all but I am looking for more "elegant" way. It would also be great if there could be a column added which will indicate how the row was filtered if it was an exact number from vec_data or it was modified by +1, +12, -24 or whatever. Please, take into account that the combination of +1/-1 with any other modification is also possible. Additional column is not necessary if it makes it too complicated.
One option could be (tolerance = 1):
df %>%
filter(sapply(Size_Change, function(x) any(abs(x - vec) %in% 0:1)))
Name Mode Change Size Place Size_Change
1 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2 Greg 2 C[+58], K[+1206] 697.356 4 2786
3 Tom 1 C[+58], C[+600] 727.696 3 2181
Tolerance = 14:
df %>%
filter(sapply(Size_Change, function(x) any(abs(x - vec) %in% 0:14)))
Name Mode Change Size Place Size_Change
1 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2 Greg 2 C[+58], K[+1206] 697.356 4 2786
3 Morka 4 C[+58], K[+259] 920.430 2 1840
4 Pekka NA C[+58], T[+385] 492.236 4 1966
5 Robert 3 C[+58], T[+385] 393.991 5 1966
6 Tim NA C[+58], T[+385] 492.239 4 1966
7 Tom 1 C[+58], C[+600] 727.696 3 2181
The same logic with rowwise():
df %>%
rowwise() %>%
filter(any(abs(Size_Change - vec) %in% 0:1))
The most obvious methodology is to filter based on inequality rather than exact matched (always recommended when comparing numeric [not integers])
comp <- function(x, yvec, tolerance = 1){
sapply(x, \(xi){any(abs(xi - yvec) <= tolerance)})
}
exp_data[comp(exp_data$Size_Change, vec_data),]
Name Mode Change Size Place Size_Change
2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
2056 Tom 1 C[+58], C[+600] 727.696 3 2181
# Tolerance = 2
# exp_data[comp(exp_data$Size_Change, vec_data, 2),]
What about using a tolerance function.
tol <- \(x, tol=1L) sapply(seq(-tol, tol, 1L), \(i) sweep(as.matrix(x), 1L, i))
exp_data[exp_data$Size_Change %in% tol(vec_data), ]
# Name Mode Change Size Place Size_Change
# 2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
# 2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
# 2056 Tom 1 C[+58], C[+600] 727.696 3 2181
It defaults to tolerance ±1, if we want ±24 we may define it in the argument:
exp_data[exp_data$Size_Change %in% tol(vec_data, 24L), ]
# Name Mode Change Size Place Size_Change
# 2049 Mark 1 C[+58], K[+2665] 1335.261 3 4004
# 2050 Greg 2 C[+58], K[+1206] 697.356 4 2786
# 2052 Morka 4 C[+58], K[+259] 920.430 2 1840
# 2053 Pekka NA C[+58], T[+385] 492.236 4 1966
# 2054 Robert 3 C[+58], T[+385] 393.991 5 1966
# 2055 Tim NA C[+58], T[+385] 492.239 4 1966
# 2056 Tom 1 C[+58], C[+600] 727.696 3 2181
# 2058 Terka 3 D[+58], I[+12][+385] 495.237 4 1978
I you are wondering about the L in 24L, it is integer notation, you may also use tol=24 without any problems.
Note: R version 4.1.2 (2021-11-01)

Complex dataframe values selection based on both rows and columns

I need to select some values on each row of the dataset below and compute a sum.
This is a part of my dataset.
> prova
key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP
18 3483 364 3509 b n m
19 2367 818 3924 b n m
20 3775 1591 802 b m n
21 929 3059 744 n b n
22 3732 530 1769 b n m
23 3503 2011 2932 b n b
24 3684 1424 1688 b n m
Rows are trials of the experiment and columns are the keys pressed, in temporal sequence (keypressRESP) and the amount of time of the key until the next one (key_duration).
So for example in the first trial (first row) I pressed "b" and after 3483 ms I pressed "n" and so on.
This is my dataframe
structure(list(key_duration1 = c(3483L, 2367L, 3775L, 929L, 3732L,
3503L, 3684L), key_duration2 = c(364L, 818L, 1591L, 3059L, 530L,
2011L, 1424L), key_duration3 = c(3509, 3924, 802, 744, 1769,
2932, 1688), KeyPress1RESP = structure(c(2L, 2L, 2L, 4L, 2L,
2L, 2L), .Label = c("", "b", "m", "n"), class = "factor"), KeyPress2RESP = structure(c(4L,
4L, 3L, 2L, 4L, 4L, 4L), .Label = c("", "b", "m", "n"), class = "factor"),
KeyPress3RESP = structure(c(3L, 3L, 4L, 4L, 3L, 2L, 3L), .Label = c("",
"b", "m", "n"), class = "factor")), row.names = 18:24, class = "data.frame")
I need a method for select in each row (trial) all "b" values, compute the sum(key_duration) and print the values on a new column, the same for "m".
How can i do?
I think that i need a function similar to 'apply()' but without compute every values on the row but only selected values.
apply(prova[,1:3],1,sum)
Thanks
Here is a way using data.table.
library(data.table)
setDT(prova)
# melt
prova_long <-
melt(
prova[, idx := 1:.N],
id.vars = "idx",
measure.vars = patterns("^key_duration", "^KeyPress"),
variable.name = "key",
value.name = c("duration", "RESP")
)
# aggregate
prova_aggr <- prova_long[RESP != "n", .(duration_sum = sum(duration)), by = .(idx, RESP)]
# spread and join
prova[dcast(prova_aggr, idx ~ paste0("sum_", RESP)), c("sum_b", "sum_m") := .(sum_b, sum_m), on = "idx"]
prova
Result
# key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP idx sum_b sum_m
#1: 3483 364 3509 b n m 1 3483 3509
#2: 2367 818 3924 b n m 2 2367 3924
#3: 3775 1591 802 b m n 3 3775 1591
#4: 929 3059 744 n b n 4 3059 NA
#5: 3732 530 1769 b n m 5 3732 1769
#6: 3503 2011 2932 b n b 6 6435 NA
#7: 3684 1424 1688 b n m 7 3684 1688
The idea is to reshape your data to long format, aggregate by "RESP" per row. Spread the result and join back to your initial data.
With tidyverse you can do:
bind_cols(df %>%
select_at(vars(starts_with("KeyPress"))) %>%
rowid_to_column() %>%
gather(var, val, -rowid), df %>%
select_at(vars(starts_with("key_"))) %>%
rowid_to_column() %>%
gather(var, val, -rowid)) %>%
group_by(rowid) %>%
summarise(b_values = sum(val1[val == "b"]),
m_values = sum(val1[val == "m"])) %>%
left_join(df %>%
rowid_to_column(), by = c("rowid" = "rowid")) %>%
ungroup() %>%
select(-rowid)
b_values m_values key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP
<dbl> <dbl> <int> <int> <dbl> <fct> <fct> <fct>
1 3483. 3509. 3483 364 3509. b n m
2 2367. 3924. 2367 818 3924. b n m
3 3775. 1591. 3775 1591 802. b m n
4 3059. 0. 929 3059 744. n b n
5 3732. 1769. 3732 530 1769. b n m
6 6435. 0. 3503 2011 2932. b n b
7 3684. 1688. 3684 1424 1688. b n m
First, it splits the df into two: one with variables starting with "KeyPress" and one with variables starting with "key_". Second, it transforms the two dfs from wide to long format and combines them by columns. Third, it creates a summary for "b" and "m" values according row ID. Finally, it merges the results with the original df.
You can make a logical matrix from the KeyPress columns, multiply it by the key_duration subset and then take their rowSums.
prova$b_values <- rowSums((prova[, 4:6] == "b") * prova[, 1:3])
prova$n_values <- rowSums((prova[, 4:6] == "n") * prova[, 1:3])
key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP b_values n_values
18 3483 364 3509 b n m 3483 364
19 2367 818 3924 b n m 2367 818
20 3775 1591 802 b m n 3775 802
21 929 3059 744 n b n 3059 1673
22 3732 530 1769 b n m 3732 530
23 3503 2011 2932 b n b 6435 2011
24 3684 1424 1688 b n m 3684 1424
It works because the logical values are coerced to numeric 1s or 0s, and only the values for individual keys are retained.
Extra: to generalise, you could instead use a function and tidyverse/purrr to map it:
get_sums <- function(key) rowSums((prova[, 4:6] == key) * prova[, 1:3])
keylist <- list(b_values = "b", n_values = "n", m_values = "m")
library(tidyverse)
bind_cols(prova, map_dfr(keylist, get_sums))

Subsetting rows based on multiple columns using data.table - fastest way

I was wondering if there was a more elegant, less clunky and faster way to do this. I have millions of rows with ICD coding for clinical data. A short example provided below. I was to subset the dataset based on either of the columns meeting a specific set of diagnosis codes. The code below works but takes ages in R and was wondering if there is a faster way.
structure(list(eid = 1:10, mc1 = structure(c(4L, 3L, 5L, 2L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("345", "410", "413.9", "I20.1",
"I23.4"), class = "factor"), oc1 = c(350, 323, 12, 35, 413.1,
345, 345, 345, 345, 345), oc2 = structure(c(5L, 6L, 4L, 1L, 1L,
2L, 2L, 2L, 3L, 2L), .Label = c("", "345", "I20.3", "J23.6",
"K50.1", "K51.4"), class = "factor")), .Names = c("eid", "mc1",
"oc1", "oc2"), class = c("data.table", "data.frame"), row.names = c(NA,
-10L), .internal.selfref = <pointer: 0x102812578>)
The code below subsets all rows that meet the code of either "I20" or "413" (this would include all codes that have for example been coded as "I20.4" or "413.9" etc.
dat2 <- dat [substr(dat$mc1,1,3)== "413"|
substr(dat$oc1,1,3)== "413"|
substr(dat$oc2,1,3)== "413"|
substr(dat$mc1,1,3)== "I20"|
substr(dat$oc1,1,3)== "I20"|
substr(dat$oc2,1,3)== "I20"]
Is there a faster way to do this? For example can i loop through each of the columns looking for the specific codes "I20" or "413" and subset those rows?
We can specify the columns of interest in .SDcols, loop through the Subset of Data.table (.SD), get the first 3 characters with substr, check whether it is %in% a vector of values and Reduce it to a single logical vector for subsetting the rows
dat[dat[,Reduce(`|`, lapply(.SD, function(x)
substr(x, 1, 3) %chin% c('413', 'I20'))), .SDcols = 2:4]]
# eid mc1 oc1 oc2
#1: 1 I20.1 350.0 K50.1
#2: 2 413.9 323.0 K51.4
#3: 5 345 413.1
#4: 9 345 345.0 I20.3
For larger data it could help if we dont chech all rows:
minem <- function(dt, colsID = 2:4) {
cols <- colnames(dt)[colsID]
x <- c('413', 'I20')
set(dt, j = "inn", value = F)
for (i in cols) {
dt[inn == F, inn := substr(get(i), 1, 3) %chin% x]
}
dt[inn == T][, inn := NULL][]
}
n <- 1e7
set.seed(13)
dt <- dts[sample(.N, n, replace = T)]
dt <- cbind(dt, dts[sample(.N, n, replace = T), 2:4])
setnames(dt, make.names(colnames(dt), unique = T))
dt
# eid mc1 oc1 oc2 mc1.1 oc1.1 oc2.1
# 1: 8 345 345.0 345 345 345 345
# 2: 3 I23.4 12.0 J23.6 413.9 323 K51.4
# 3: 4 410 35.0 413.9 323 K51.4
# 4: 1 I20.1 350.0 K50.1 I23.4 12 J23.6
# 5: 10 345 345.0 345 345 345 345
# ---
# 9999996: 3 I23.4 12.0 J23.6 I20.1 350 K50.1
# 9999997: 5 345 413.1 I20.1 350 K50.1
# 9999998: 4 410 35.0 345 345 345
# 9999999: 4 410 35.0 410 35
# 10000000: 10 345 345.0 345 345 345 I20.3
system.time(r1 <- akrun(dt, 2:ncol(dt))) # 22.88 sek
system.time(r2 <- minem(dt, 2:ncol(dt))) # 17.72 sek
all.equal(r1, r2)
# [1] TRUE

grouping, comparing, and counting rows in r

I have a data frame that looks as the following:
system Id initial final
665 9 16001 6070 6071
683 10 16001 6100 6101
696 11 16001 6101 6113
712 10 16971 6150 6151
715 11 16971 6151 6163
4966 7 4118 10238 10242
5031 9 4118 10260 10278
5088 10 4118 10279 10304
5115 11 4118 10305 10317
structure(list(system = c(9L, 10L, 11L, 10L, 11L, 7L, 9L, 10L,
11L), Id = c(16001L, 16001L, 16001L, 16971L, 16971L, 4118L, 4118L,
4118L, 4118L), initial = c(6070, 6100, 6101, 6150, 6151, 10238,
10260, 10279, 10305), final = c(6071, 6101, 6113, 6151, 6163,
10242, 10278, 10304, 10317)), .Names = c("system", "Id", "initial",
"final"), row.names = c(665L, 683L, 696L, 712L, 715L, 4966L,
5031L, 5088L, 5115L), class = "data.frame")
I would like to get a new data frame with the next structure
Id system length initial final
1 16001 9,10,11 3 6070 6113
2 16971 10,11 2 6150 6163
3 4118 7 1 10238 10242
4 4118 9,10,11 3 10260 10317
structure(list(Id = c(16001L, 16971L, 4118L, 4118L), system = structure(c(3L,
1L, 2L, 3L), .Label = c("10,11", "7", "9,10,11"), class = "factor"),
length = c(3L, 2L, 1L, 3L), initial = c(6070L, 6150L, 10238L,
10260L), final = c(6113, 6163, 10242, 10317)), .Names = c("Id",
"system", "length", "initial", "final"), class = "data.frame", row.names = c(NA,
-4L))
The grouping is by Id and the difference (between rows) in "system" field equal to one. Also I would like to get the different "system" and how many of that involved in grouping. Finally a column with the first "initial" and the last "final" involved also.
It is possible to do that in r?
Thanks.
You could use data.table. Convert "data.frame" to "data.table" (setDT), create a grouping variable "indx" by taking the difference of adjacent elements of "system" (diff(system)), cumsum the logical vector, use "Id" and "indx" as grouping variable to get the statistics.
library(data.table)
setDT(df)[,list(system=toString(system), length=.N, initial=initial[1L],
final=final[.N]), by=list(Id,indx=cumsum(c(TRUE, diff(system)!=1)))][,
indx:=NULL][]
# Id system length initial final
#1: 16001 9, 10, 11 3 6070 6113
#2: 16971 10, 11 2 6150 6163
#3: 4118 7 1 10238 10242
#4: 4118 9, 10, 11 3 10260 10317
Or based on #jazzurro's comment about using first/last functions from dplyr,
library(dplyr)
df %>%
group_by(indx=cumsum(c(TRUE, diff(system)!=1)), Id) %>%
summarise(system=toString(system), length=n(),
initial=first(initial), final=last(final))
A solution without data.table, but plyr:
library(plyr)
func = function(subdf)
{
bool = c(diff(subdf$system),1)==1
ldply(split(subdf, bool), function(u){
data.frame(system = paste(u$system, collapse=','),
Id = unique(u$Id),
length = nrow(u),
initial= head(u,1)$initial,
final = tail(u,1)$final)
})
}
ldply(split(df, df$Id), func)
# .id system length Id initial final
#1 FALSE 7 1 4118 10238 10242
#2 TRUE 9,10,11 3 4118 10260 10317
#3 TRUE 9,10,11 3 16001 6070 6113
#4 TRUE 10,11 2 16971 6150 6163

Resources