cbind specific columns from multiple data.tables efficiently

cbind specific columns from multiple data.tables efficiently - r

I have a list of data.tables that I need to cbind, however, I only need the last X columns.
My data is structured as follows:
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)
>DT.list
[[1]]
x y v1 v2
1: 1 a 1 3
2: 1 a 2 4
[[2]]
x y v3
1: 1 a 5
2: 1 a 6
[[3]]
x y v4 v5 v6
1: 1 a 7 9 11
2: 1 a 8 10 12
Columns x and y are the same for each of the data.tables but the amount of columns differs. The output should not include duplicate x, and y columns. It should look as follows:
x y v1 v2 v3 v4 v5 v6
1: 1 a 1 3 5 7 9 11
2: 1 a 2 4 6 8 10 12
I want to avoid using a loop. I am able to bind the data.tables using do.call("cbind", DT.list) and then remove the duplicates manually, but is there a way where the duplicates aren't created in the first place? Also, efficiency is important since the lists can be long with large data.tables.
thanks

Here's another way:
Reduce(
function(x,y){
newcols = setdiff(names(y),names(x))
x[,(newcols)] <- y[, ..newcols]
x
},
DT.list,
init = copy(DT.list[[1]][,c("x","y")])
)
# x y v1 v2 v3 v4 v5 v6
# 1: 1 a 1 3 5 7 9 11
# 2: 1 a 2 4 6 8 10 12
This avoids modifying the list (as #bgoldst's <- NULL assignment does) or making copies of every element of the list (as, I think, the lapply approach does). I would probably do the <- NULL thing in most practical applications, though.

Here's how it could be done in one shot, using lapply() to remove columns x and y from second-and-subsequent data.tables before calling cbind():
do.call(cbind,c(DT.list[1],lapply(DT.list[2:length(DT.list)],`[`,j=-c(1,2))));
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
Another approach is to remove columns x and y from second-and-subsequent data.tables before doing a straight cbind(). I think there's nothing wrong with using a for loop for this:
for (i in seq_along(DT.list)[-1]) DT.list[[i]][,c('x','y')] <- NULL;
DT.list;
## [[1]]
## x y v1 v2
## 1: 1 a 1 3
## 2: 1 a 2 4
##
## [[2]]
## v3
## 1: 5
## 2: 6
##
## [[3]]
## v4 v5 v6
## 1: 7 9 11
## 2: 8 10 12
##
do.call(cbind,DT.list);
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12

Another option would be to use the [,, indexing function option inside lapplyon the list of data tables and exclude "unwanted" columns (in your case x and y). In this way, duplicates columns are not created.
# your given test data
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)
A) using a character vector to indicate which columns to exclude
# cbind a list of subsetted data.tables
exclude.col <- c("x","y")
myDT <- do.call(cbind, lapply(DT.list, `[`,,!exclude.col, with = FALSE))
myDT
## v1 v2 v3 v4 v5 v6
## 1: 1 3 5 7 9 11
## 2: 2 4 6 8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
B) same as above but using the character vector directly in lapply
myDT <- do.call(cbind, lapply(DT.list, `[`,,!c("x","y")))
myDT
## v1 v2 v3 v4 v5 v6
## 1: 1 3 5 7 9 11
## 2: 2 4 6 8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
C) same as above, but all in one line
do.call( cbind, c(list(DT.list[[1]][,.(x,y)]), lapply(DT.list, `[`,,!c("x","y"))) )
# way too many brackets...but I think it works
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12

Related

How to use R language to determine whether a line is the same character

I have a data.frame like the following：
V1 V2 V3 V4 V5
1 a a b a a
2 a a a
3 b b b b
4 a c d
I want to keep the lines with the same character in one line (in my example, lines 2 and 3), is there any function that can help me achieve this requirement?

Here is a base R option using apply
df[apply(df, 1, function(x) length(unique(x[x != ''])) == 1), ]
#V1 V2 V3 V4 V5
#2 a a a
#3 b b b b
Explanation: length(unique(x[x != '')) == 1 checks if non-empty elements of a vector x contain only a single unique element. apply with MARGIN = 1 means that we loop through the rows of the data.frame.
Sample data
df <- read.table(text = " V1 V2 V3 V4 V5
1 a a b a a
2 a a a '' ''
3 b b b b ''
4 a c d '' ''", header = T)

Calculate probability and consecutive events in a dataframe

My dataset has 575 rows and 368 columns and it looks like this:
NUTS3_2016 URAU_CODE FUA_CODE X2018.01.01.x X2018.01.02.x X2018.01.03.x ...
1 AT130 AT001C1 AT001L3 0.46369280 0.3582241 0.2777274 ...
2 AT211 AT006C1 AT006L2 -0.04453125 -0.3092773 -0.3284180 ...
3 AT312 AT003C1 AT003L3 1.02993164 0.9640137 0.6413086 ...
4 AT323 AT004C1 AT004L3 1.21105239 1.4335363 1.2400620 ...
... ... .... ... ... ... .... ...
I want to calculate the probability that x>2.5 for each row.
I also want to calculate for how many consecutive days x remains >2.5 for each row.
What are your suggestions?
Many thanks
Attempt:
A <- c("a", "b", "c", "d", "e")
B <- c(1:5)
C <- c(1:5)
x <- data.frame(A,B,C)
x$prob <- rowMeans(x[-(1)]>2)
x
# A B C prob
# 1 a 1 1 0
# 2 b 2 2 0
# 3 c 3 3 1
# 4 d 4 4 1
# 5 e 5 5 1

We can use rle for finding the length of the maximum streak.
## Some sample data:
set.seed(47)
data = matrix(rnorm(24, mean = 2.5), nrow = 3)
data = cbind(ID = c("A", "B", "C"), as.data.frame(data))
data
# ID V1 V2 V3 V4 V5 V6 V7 V8
# 1 A 4.494696 2.218235 1.514518 1.034250 2.9938202 3.170779 1.7966118 2.749148
# 2 B 3.211143 2.608776 2.515131 1.577544 0.6717708 2.418922 2.4594218 2.159584
# 3 C 2.685405 1.414263 2.247954 2.539602 2.5914729 3.764241 0.9338379 2.917191
data$max_streak = apply(data[-1], 1, function(x) with(rle(x > 2.5), max(lengths[values])))
# ID V1 V2 V3 V4 V5 V6 V7 V8 max_streak
# 1 A 4.494696 2.218235 1.514518 1.034250 2.9938202 3.170779 1.7966118 2.749148 2
# 2 B 3.211143 2.608776 2.515131 1.577544 0.6717708 2.418922 2.4594218 2.159584 3
# 3 C 2.685405 1.414263 2.247954 2.539602 2.5914729 3.764241 0.9338379 2.917191 3

Ordering stacked barplot based on multiple values

I'm asking this question because even though there are many similar questions on this website (like this, this, and this), none of them are exactly my situation. Actually, this link is asking the same question as mine, but the answer there is unclear to me and raises the question that I am about to ask.
I have a dataset from which I am constructing a stacked barplot, and I wan't to know how I can arrange the stacked barplot where "similar" individuals cluster together. I work in bioinformatics, and here is the dataset which is a d-by-n matrix. In this toy dataset, there are d=10 ancestral populations and n = 5 individuals:
> a
V1 V2 V3 V4 V5
1 0.534410243 0.009358740 0.011295181 0.2141751740 0.0030129254
2 0.026653603 0.372426720 0.447847534 0.0179177507 0.4072904477
3 0.193317915 0.003605024 0.003186611 0.4832114736 0.0007095471
4 0.111881585 0.000000000 0.000000000 0.2296213741 0.0119233461
5 0.089696570 0.591163629 0.509774416 0.0032542030 0.5535847030
6 0.007543558 0.000000000 0.000000000 0.0364907757 0.0013148362
7 0.004862942 0.000000000 0.002123909 0.0146682272 0.0004053690
8 0.009276195 0.011710457 0.014367894 0.0000000000 0.0000000000
9 0.006903171 0.004314528 0.011404455 0.0000000000 0.0126889937
10 0.015454219 0.007420903 0.000000000 0.0006610215 0.0090698319
All columns add up to 1. I create a stacked barplot like so:
pop <- rownames(a)
a <- a %>% mutate(pop = rownames(a))
a_long <- gather(a, key, value, -pop)
# trying to create a similarity index
a_long <- a_long %>% group_by(key) %>%
mutate(mean = mean(value)) %>%
arrange(desc(mean))
# looking at some of the expanded dataset
> a_long[1:20,]
# A tibble: 20 x 4
# Groups: key [2]
pop key value mean
<chr> <chr> <dbl> <dbl>
1 1 V2 0.00936 0.1
2 2 V2 0.372 0.1
3 3 V2 0.00361 0.1
4 4 V2 0 0.1
5 5 V2 0.591 0.1
6 6 V2 0 0.1
7 7 V2 0 0.1
8 8 V2 0.0117 0.1
9 9 V2 0.00431 0.1
10 10 V2 0.00742 0.1
11 1 V4 0.214 0.1
12 2 V4 0.0179 0.1
13 3 V4 0.483 0.1
14 4 V4 0.230 0.1
15 5 V4 0.00325 0.1
16 6 V4 0.0365 0.1
17 7 V4 0.0147 0.1
18 8 V4 0 0.1
19 9 V4 0 0.1
20 10 V4 0.000661 0.1
# colors
v_colors <- c("#440154FF", "#443B84FF", "#34618DFF", "#404588FF", "#1FA088FF", "#40BC72FF",
"#67CC5CFF", "#A9DB33FF", "#DDE318FF", "#FDE725FF")
plot <- ggplot(a_long, aes(x = key, y = value, fill = pop))
plot + geom_bar(position="stack", stat="identity") + scale_fill_manual(values = v_colors)
The output looks like this:
How can I make the output look more neat, e.g. with the individuals with higher proportion of population 5 ancestry be next to each other on the x-axis? So far, I have tried to compute the "mean" of value of each individual, but it didn't work since it's not a good measure. How can I create a similarity index that tells me how similar individual 1 is to individual 2, and then how do I order it them on the x-axis so that they look well-clustered (e.g. like the barplots in this figure)?
Thanks!
One last thing: if you want to re-create the dataset a, here is the code:
v1 = c(0.534410243, 0.026653603, 0.193317915, 0.111881585, 0.089696570, 0.007543558, 0.004862942, 0.009276195, 0.006903171, 0.015454219)
v2 = c(0.009358740, 0.372426720, 0.003605024, 0.000000000, 0.591163629, 0.000000000, 0.000000000, 0.011710457, 0.004314528, 0.007420903)
v3 = c(0.011295181, 0.447847534, 0.003186611, 0.000000000, 0.509774416, 0.000000000, 0.002123909, 0.014367894, 0.011404455, 0.000000000)
v4 = c(0.2141751740, 0.0179177507, 0.4832114736, 0.2296213741, 0.0032542030, 0.0364907757, 0.0146682272, 0.0000000000, 0.0000000000, 0.0006610215)
v5 = c(0.0030129254, 0.4072904477, 0.0007095471, 0.0119233461, 0.5535847030, 0.0013148362, 0.0004053690, 0.0000000000, 0.0126889937, 0.0090698319)
a = data.frame(V1 = v1, V2 = v2, V3 = v3, V4 = v4, V5 = v5)

extract values after first character in data frame column

I have the following dataframe
df <- data.frame(V1 = c(1, 2), V2 = c(10, 20), V3=c("9,1", "13,3,4"))
> df
V1 V2 V3
1 1 10 9,1
2 2 20 13,3,4
Now I want to create a new column 'V4' that takes the value after the first ',' from V3, divides it by the value in V2 and multiplies that by 100
In my example this would be:
(1 divided by 10) * 100 = 10
(3 divided by 20) * 100 = 15
So the output would look like this
df_new
V1 V2 V3 V4
1 1 10 9,1 10
2 2 20 13,3,4 15
How can this be achieved?

We can use regex to extract the number after first comma divide it by V2 and multiply by 100.
transform(df, V4 = as.integer(sub("\\d+,(\\d+).*", "\\1", V3))/V2 * 100)
# V1 V2 V3 V4
#1 1 10 9,1 10
#2 2 20 13,3,4 15

Cummean column-wise and ignoring NAs

I have a data frame looking like this:
as.data.frame(matrix(c(1,2,3,NA,4,5,NA,NA,9), nrow = 3, ncol = 3))
V1 V2 V3
1 1 NA NA
2 2 4 NA
3 3 5 9
I would like to calculate a cumulative mean per column, which ignores NAs, so something like this:
V1 V2 V3
1 1 NA NA
2 3 4 NA
3 6 9 9
I tried this:
B[!is.na(A)] <- as.data.frame(apply(B[!is.na(A)], 2, cummean))
But received this error message:
dim(X) must have a positive length
Thanks for your help!
Cheers

This should work :
A <- as.data.frame(matrix(c(1,2,3,NA,4,5,NA,NA,9), nrow = 3, ncol = 3))
B <- as.data.frame(apply(A,2,function(col){
col[!is.na(col)] <- dplyr::cummean(col[!is.na(col)])
return(col)
}))
> B
V1 V2 V3
1 1.0 NA NA
2 1.5 4.0 NA
3 2.0 4.5 9

We can use data.table
library(data.table)
library(dplyr)
setDT(d1)
for(j in seq_along(d1)){
set(d1, i = which(!is.na(d1[[j]])), j=j, value = cummean(d1[[j]][!is.na(d1[[j]])]))
}
d1
# V1 V2 V3
#1: 1.0 NA NA
#2: 1.5 4.0 NA
#3: 2.0 4.5 9

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

cbind specific columns from multiple data.tables efficiently - r

Related

How to use R language to determine whether a line is the same character

Calculate probability and consecutive events in a dataframe

Ordering stacked barplot based on multiple values

extract values after first character in data frame column

Cummean column-wise and ignoring NAs

Categories

Resources