Add columns to a list of data.frames - r

This is a similar question to an old post of mine Add columns to a dataframe based on values from a list.
Now, I would like to use the solution provide it, but instead of applying it to a single data.frame, I'd like to use it on a list of data.frame.
Briefly, I have a list of data.frames that looks like this:
df <- list(data.frame(A=c("a","b","c"),
B=c("1","2","1"),
C=c(0.1,0.7,0.4)),
data.frame(A=c("d","e","f"),
B=c("2","2","3"),
C=c(0.5,0.1,0.5)),
data.frame(A=c("g","h","i"),
B=c("3","1","2"),
C=c(0.2,0.1,0.5)))
And a list with elements which names match to df$B, i.e, these values are permutations of values from df$B, here is an example:
ll <- list('1'=c(0.1,0.1,0.4,0.2,0.1,0.4),
'2'=c(0.1,0.1,0.5,0.7,0.5,0.7),
'3'=c(0.1,0.1,0.2,0.2,0.2,0.5))
I want to create a new list of data.frames but with new columns in each dataframe of list df that correspond to the values of df$B in list ll but at the same time they are sampled values from ll?
Here is a desired output for a better explanation
> list.df
[[1]]
A B C P1 P2 P3 P4 P5 P6
1 a 1 0.1 0.1 0.1 0.4 0.2 0.1 0.4
2 b 2 0.7 0.1 0.5 0.7 0.1 0.5 0.1
3 c 1 0.4 0.4 0.1 0.2 0.1 0.1 0.4
[[2]]
A B C P1 P2 P3 P4 P5 P6
1 d 2 0.5 0.1 0.7 0.5 0.1 0.7 0.1
2 e 2 0.1 0.7 0.5 0.1 0.7 0.1 0.5
3 f 3 0.5 0.5 0.5 0.2 0.1 0.2 0.1
[[3]]
A B C P1 P2 P3 P4 P5 P6
1 g 3 0.2 0.1 0.5 0.2 0.2 0.2 0.5
2 h 1 0.1 0.2 0.1 0.4 0.2 0.2 0.4
3 i 2 0.5 0.1 0.5 0.1 0.1 0.5 0.7
The solution that I have for one single data.frame is this:
sampfun <- function(i, l) sample(l[[as.character(i)]], 10000, replace=TRUE)
list.df <- cbind(df, t(sapply(df$B, sampfun, l = ll)))
The problem is that I don't know how to implement this solution for use with a list of data.frames.
Many thanks for the help
Note: my real list of data.frames has 9,000 elements, and I look to add more than 10,000 columns so the memory and speed up are important.

Related

How split a heatmap at the specific rows? Error invalid for atomic vectors

I am trying to build a heatmap in R. I wanted to split the heatmap at specific rows. For example my matrix is as:
ID A B C
FD_1 0.3 0.2 1
FD_2 0.4 1 0.9
FD_3 0.6 0.8 0.2
FS_1 0.3 0.2 1
FS_2 0.4 1 0.9
FS_3 0.6 0.8 0.2
FS_4 0.4 1 0.9
FS_5 0.6 0.8 0.2
FE_1 0.3 0.2 1
FE_2 0.4 1 0.9
FE_3 0.6 0.8 0.2
FE_4 0.4 1 0.9
I need to make a heatmap that includes 3 slice: one for 3 FD, one for 5 FS and one for 4 FE. And label each slice with their name as FD, FS and FE.
I'm using this code:
Heatmap(M_matrix, name = "level", row_split = M_matrix$ID)
But I'm getting this error:
Error in M_matrix$ID : $ operator is invalid for atomic vectors
Any suggestion?
Thanks
You can define the splits based on your ID column:
library(ComplexHeatmap)
ID=c(paste0("FD_", 1:3), paste0("FS_", 1:5), paste0("FE_", 1:4))
df <- data.frame(ID=ID,
matrix(rnorm(3*12, mean = 3), ncol=3,
dimnames=list(ID, LETTERS[1:3])),
stringsAsFactors = FALSE)
splits <- factor(gsub("_.*", "", ID))
Heatmap(matrix=as.matrix(df[,-1] ), row_split = splits, cluster_row_slices = FALSE)
If you want list of dataframes based on ID, we can use split.
list_df <- split(df, sub("_.*", "", df$ID))
list_df
#$FD
# ID A B C
#1 FD_1 0.3 0.2 1.0
#2 FD_2 0.4 1.0 0.9
#3 FD_3 0.6 0.8 0.2
#$FE
# ID A B C
#9 FE_1 0.3 0.2 1.0
#10 FE_2 0.4 1.0 0.9
#11 FE_3 0.6 0.8 0.2
#12 FE_4 0.4 1.0 0.9
#$FS
# ID A B C
#4 FS_1 0.3 0.2 1.0
#5 FS_2 0.4 1.0 0.9
#6 FS_3 0.6 0.8 0.2
#7 FS_4 0.4 1.0 0.9
#8 FS_5 0.6 0.8 0.2
We can use group_split
library(dplyr)
library(stringr)
list_df <- df %>%
group_split(grp = str_remove(ID, "_.*"), keep = FALSE)

R: Apply function on data frame A dependent on values of data frame B

I have two data frames A and B.
A = data.frame(x = c(3,-4,2), y=c(-4,7,1), z=c(-5,-1,6))
B = data.frame(x = c(0.5,0.9,0.3), y=c(0.7,0.2,0.1), z=c(0.9,0.8,0.6))
If a value in A is negative the corresponding value in B (the same position like in A) should be subtracted from 1. If the value in A is positive the corresponding value in B should not change.
In the end B should look like this
x y z
1 0.5 0.3 0.1
2 0.1 0.2 0.2
3 0.3 0.1 0.6
Anyone an idea how this problem can be solved?
Thanks in advance,
Christian
This seems to work: B[A<0] <- 1 - B[A<0]
x y z
1 0.5 0.3 0.1
2 0.1 0.2 0.2
3 0.3 0.1 0.6

R 3d plot of distance matrix with colored points

I've plotted a distance matrix in R using scatterplot3d, and would now like to assign a unique color to every single point. For instance, in the following example, the plot would contain five points (A-E):
A B C D E
A 0 0.1 0.2 0.1 0.2
B 0.1 0 0.1 0.2 0.1
C 0.2 0.1 0 0.1 0.2
D 0.1 0.2 0.1 0 0.1
E 0.2 0.1 0.2 0.1 0
At present, my scatterplot3d code for the appearance of the points is very simple:
s3d <- scatterplot3d(x,y,z, main="Just A Test", pch = 19)
How do I go about making each of the points appear a different color (using hex codes)?
Have you looked at the color argument in ?scatterplot3d .... ?
dd <- read.table(header=TRUE,text="
A B C D E
A 0 0.1 0.2 0.1 0.2
B 0.1 0 0.1 0.2 0.1
C 0.2 0.1 0 0.1 0.2
D 0.1 0.2 0.1 0 0.1
E 0.2 0.1 0.2 0.1 0")
Here I'm assuming you want to use columns A-C as your coordinates ...
library(scatterplot3d)
## some made-up colors
cols <- c("#000000","#fa0ace","#eeabce","#5a0af0","#883856")
s3d <- with(dd,scatterplot3d(A,B,C,
main="Just A Test", pch = 19, color=cols,cex.symbols=3))

Use index of a list of data.frames to apply a function in certain elements of a data frame

I have a data.frame that looks like this:
>df
A B C P1 P2 P3 P4 P5 P6
1 a 1 0.1 0.1 0.1 0.4 0.2 0.1 0.4
2 b 1 0.2 0.1 0.4 0.2 0.1 0.2 0.2
3 c 1 0.4 0.4 0.1 0.2 0.1 0.1 0.4
4 d 2 0.1 0.1 0.7 0.5 0.1 0.7 0.1
5 e 2 0.5 0.7 0.5 0.1 0.7 0.1 0.5
6 f 2 0.7 0.5 0.5 0.7 0.1 0.7 0.1
7 g 3 0.1 0.1 0.1 0.2 0.2 0.2 0.5
8 h 3 0.2 0.2 0.1 0.5 0.2 0.2 0.5
9 i 3 0.5 0.1 0.2 0.1 0.1 0.5 0.2
And a list of data.frames similar to this one:
list.1 <- list(data.frame(AA=c("a","b","c","d")),
data.frame(BB=c("e","f")),
data.frame(CC=c("a","b","i")),
data.frame(DD=c("d","e","f","g")))
Besides, I have this function:
Fisher.test <- function(p) {
Xsq <- -2*sum(log(p), na.rm=T)
p.val <- 1-pchisq(Xsq, df = 2*length(p))
return(p.val)
}
I would like to select in df those values of df$A that correspond to each data.frame in the list and compute Fisher.test for P1...P6. The way I was doing it is merging df with list.1 and then apply Fisher.method to each data.frame in the list:
func <- function(x,y){merge(x,y, by.x=names(x)[1], by.y=names(y)[1])}
ll <- lapply(list.1, func, df)
ll.fis <- lapply(ll, FUN=function(i){apply(i[,4:9],2,Fisher.test)})
This works but my real data is huge, so I think that a different approach could use the index of elements of list.1[1] to calculate Fisher.test in df storing the result, then use the index of list.1[2] and calculate Fisher.test and so on. In this way, the merging would be avoided because all the calculations are made over df, also, the RAM resources would be also minimised with this approach. However, I have no clue how to achieve this. Perhaps a for loop?
Thanks
Leveraging data.table here is helpful since you can easily subset your data using .( ) syntax and extremely fast, especially with large data compared to working with, say subset
library(data.table)
# convert to data.table, setting the key to the column `A`
DT <- data.table(df, key="A")
p.col.names <- paste0("P", 1:6)
results <- lapply(list.1, function(ll)
DT[.(ll)][, lapply(.SD, Fisher.test), .SDcols=p.col.names] )
results
side note
You might want to fix the names of list.1 so that the results form lapply are properly named
# fix the names, helpful for the lapply
names(list.1) <- lapply(list.1, names)
results:
$AA
P1 P2 P3 P4 P5 P6
1: 0.04770305 0.1624142 0.2899578 0.029753 0.1070376 0.17549
$BB
P1 P2 P3 P4 P5 P6
1: 0.7174377 0.5965736 0.2561482 0.2561482 0.2561482 0.1997866
$CC
P1 P2 P3 P4 P5 P6
1: 0.0317663 0.139877 0.139877 0.05305057 0.1620897 0.2189595
$DD
P1 P2 P3 P4 P5 P6
1: 0.184746 0.4246214 0.2704228 0.1070376 0.3215871 0.1519672

connect two matrixes by columns and extract sub matrix

I have two matrixes (e.g., A and B). I would like to extract columns of B based on the order of A's first column:
For example
matrix A
name score
a 0.1
b 0.2
c 0.1
d 0.6
matrix B
a d b c g h
0.1 0.2 0.3 0.4 0.6 0.2
0.2 0.1 0.4 0.7 0.1 0.1
...
I want matrix B to look like this at the end
matrix B_modified
a b c d
0.1 0.3 0.4 0.2
0.2 0.4 0.7 0.1
Can this be done either in perl or R? thanks a lot in advance
I've no idea what problems you're facing. Here's how I've done it.
## get data as matrix
a <- read.table(header=TRUE, text="name score
a 0.1
b 0.2
c 0.1
d 0.6", stringsAsFactors=FALSE) # load directly as characters
b <- read.table(header=TRUE, text="a d b c g h
0.1 0.2 0.3 0.4 0.6 0.2
0.2 0.1 0.4 0.7 0.1 0.1", stringsAsFactors=FALSE)
a <- as.matrix(a)
b <- as.matrix(b)
Now subset to get your final result:
b[, a[, "name"]]
# a b c d
# [1,] 0.1 0.3 0.4 0.2
# [2,] 0.2 0.4 0.7 0.1
The error :
[.data.frame(b, , a[, "name"]) : undefined columns selected
means that you try to get a column non defined in b but exist in a$name. One solution is to use intersect with colnames(b). This will convert also the factor to a string and you get the right order.
b[, intersect(a[, "name"],colnames(b))] ## the order is important here
For example , I test this with this data:
b <- read.table(text='
a d b c
0.1 0.2 0.3 0.4
0.2 0.1 0.4 0.7',header=TRUE)
a <- read.table(text='name score
a 0.1
z 0.5
c 0.1
d 0.6',header=TRUE)
b[, intersect(a[, "name"],colnames(b))]
a c d
1 0.1 0.4 0.2
2 0.2 0.7 0.1
If your data originates as an R data structure then it would be perverse to export it and solve this problem using Perl. However, if you have text files that look like the data you have shown, then here is a Perl solution for you.
I have split the output on spaces. That can be changed very simply if necessary.
use strict;
use warnings;
use autodie;
sub read_file {
my ($name) = #_;
open my $fh, '<', $name;
my #data = map [ split ], <$fh>;
\#data;
}
my $matrix_a = read_file('MatrixA.txt');
my #fields = map $matrix_a->[$_][0], 1 .. $#$matrix_a;
my $matrix_b = read_file('MatrixB.txt');
my #headers = #{$matrix_b->[0]};
my #indices = map {
my $label = $_;
grep $headers[$_] eq $label, 0..$#headers
} #fields;
for my $row (0 .. $#$matrix_b) {
print join(' ', map $matrix_b->[$row][$_], #indices), "\n";
}
output
a b c d
0.1 0.3 0.4 0.2
0.2 0.4 0.7 0.1

Resources