Calculating the gene overlap between different gene lists as a % - r

I have generated a table to show the overlap between different gene lists. Since I have eight different gene lists I have 64 results. The code I currently have is as follows:
#-------------------------------------------------------------------------------
# Set the working directory and load the data files
#-------------------------------------------------------------------------------
setwd("~/Desktop/R_Project/Gene_overlap")
getwd()
files <- list.files(pattern="*.txt", full.names = TRUE)
files
data.list <- lapply(files, function(fil) {
scan(file=fil, what=character())
})
names(data.list) <- basename(files) %>% stringr::str_remove("\\.txt$")
str(data.list)
# List of 8
# $ GSE108363_BCGdown_D: chr [1:350] "IL1B" "IL6" "IL1A" "CCL20" ...
# $ GSE108363_BCGdown_V: chr [1:267] "IL6" "CCL20" "IL1A" "CXCL5" ...
# $ GSE108363_BCGup_D : chr [1:250] "FABP4" "CMTM2" "FUCA1" "CD36" ...
# $ GSE108363_BCGup_V : chr [1:429] "FCN1" "FCGR3B" "MNDA" "CPVL" ...
# $ GSE108363_MTBdown_D: chr [1:86] "CCL20" "IL1B" "IL1A" "IL6" ...
# $ GSE108363_MTBdown_V: chr [1:244] "IL1B" "IL1A" "CCL20" "IL6" ...
# $ GSE108363_MTBup_D : chr [1:128] "FUCA1" "FGL2" "TGFBI" "CPVL" ...
# $ GSE108363_MTBup_V : chr [1:286] "FABP4" "RNASE1" "MNDA" "CPVL" ...
intersect(data.list$GSE108363_BCGdown_D, data.list$GSE108363_BCGdown_V) %>% length
sapply(data.list, length)
#-------------------------------------------------------------------------------
# Using the intersect function to see the overlaps
#-------------------------------------------------------------------------------
data.file1 <- "GSE108363_BCGdown_V.txt"
data.file2 <- "GSE108363_BCGdown_D.txt"
data.file3 <- "GSE108363_BCGup_V.txt"
data.file4 <- "GSE108363_BCGup_D.txt"
data.file5 <- "GSE108363_MTBdown_V.txt"
data.file6 <- "GSE108363_MTBdown_D.txt"
data.file7 <- "GSE108363_MTBup_V.txt"
data.file8 <- "GSE108363_MTBup_D.txt"
genevect1 <- scan(data.file1, what=character(), sep="\n")
genevect2 <- scan(data.file2, what=character(), sep="\n")
genevect3 <- scan(data.file3, what=character(), sep="\n")
genevect4 <- scan(data.file4, what=character(), sep="\n")
genevect5 <- scan(data.file5, what=character(), sep="\n")
genevect6 <- scan(data.file6, what=character(), sep="\n")
genevect7 <- scan(data.file7, what=character(), sep="\n")
genevect8 <- scan(data.file8, what=character(), sep="\n")
filelist <- list(data.file1, data.file2, data.file3, data.file4, data.file5, data.file6, data.file7, data.file8)
all(sapply(filelist, file.exists))
# read files:
gene.lists <- lapply(filelist, function(f) {
scan(file=f, what=character())
})
# set up empty matrix
x <- (length(gene.lists))^2
x
y <- rep(NA, x)
mx <- matrix(y, ncol=length(gene.lists))
mx
row.names(mx) <- sapply(filelist, basename) %>% stringr::str_remove('.txt$')
colnames(mx) <- sapply(filelist, basename) %>% stringr::str_remove('.txt$')
mx
mx.overlap.count <- mx
# seq_along(gene.lists) # 1 2 3 4 5 6 7 8
for (i in seq_along(gene.lists)) {
g1 <- gene.lists[[i]]
for (j in seq_along(gene.lists)) {
g2 <- gene.lists[[j]]
a <- intersect(g1, g2)
b <- length(a)
mx.overlap.count[j,i] <- b
}
}
mx.overlap.count
View(mx.overlap.count)
I would now like to do something similar but instead of viewing the overlap as numbers, I would like to see the extent to which overlaps exist between different gene lists as a percentage. Somehow i will need to see whether g1 or g2 is greater in order to divide b by the larger one before multiplying by 100. Any suggestions would be greatly appreciated.

using a letter sample as you did not provide the genes list:
set.seed(1)
data.list <- lapply(sample(10:20), function(n)LETTERS[sample(1:26, n)])
overlaps <- sapply(data.list, function(g1)
sapply(data.list, function(g2)
{round(length(intersect(g1, g2)) / length(g2) * 100)}))
overlaps
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] 100 50 67 75 42 92 58 92 67 33 92
[2,] 46 100 62 69 54 77 62 69 69 54 62
[3,] 53 53 100 60 33 73 60 73 80 33 60
[4,] 53 53 53 100 47 71 53 76 53 29 82
[5,] 45 64 45 73 100 91 64 82 36 45 73
[6,] 61 56 61 67 56 100 56 89 56 33 72
[7,] 50 57 64 64 50 71 100 86 71 50 64
[8,] 55 45 55 65 45 80 60 100 60 40 80
[9,] 50 56 75 56 25 62 62 75 100 38 69
[10,] 40 70 50 50 50 60 70 80 60 100 70
[11,] 58 42 47 74 42 68 47 84 58 37 100
(I used set.seed so you can reproduce the example). It uses a nested sapply to iterate over both gene lists individually and then calculates the percentage for each combination of gene vectors, by dividing the length of the intersect by the total length of the second gene vector. If you want to divide by the length of the longest of the 2 gene vectors, replace length(g2) by max(length(g1), length(g2))

Related

How to use characters in variables summing in R?

I have some dataframe. Here is a small expample:
a <- rnorm(100, 5, 2)
b <- rnorm(100, 10, 3)
c <- rnorm(100, 15, 4)
df <- data.frame(a, b, c)
And I have a character variable vect <- "c('a','b')"
When I try to calculate sum of vars using command
df$d <- df[vect]
which must be an equivalent of
df$d <- df[c('a','b')]
But, as a reslut I have got an error
[.data.frame(df, vect) :undefined columns selected
You're assumption that
vect <- "c('a','b')"
df$d <- df[vect]
is equivalent to
df$d <- df[c('a','b')]
is incorrect.
As #Karthik points out, you should remove the quotation marks in the assignment to vect
However, from your question it sounds like you want to then sum the elements specified in vect and then assign to d. To do this you need to slightly change your code
vect <- c('a','b')
df$d <- apply(X = df[vect], MARGIN = 1, FUN = sum)
This does elementwise sum on the columns in df specified by vect. The MARGIN = 1 specifies that we want to apply the sum rowise rather than columnwise.
EDIT:
As #ThomasIsCoding points out below, if for some reason vect has to be a string, you can parse a string to an R expression using str2lang
vect <- "c('a','b')"
parsed_vect <- eval(str2lang(vect))
df$d <- apply(X = df[parsed_vect], MARGIN = 1, FUN = sum)
Perhaps you can try
> df[eval(str2lang(vect))]
a b
1 8.1588519 9.0617818
2 3.9361214 13.2752377
3 5.5370983 8.8739725
4 8.4542050 8.5704234
5 3.9044461 13.2642793
6 5.6679639 12.9529061
7 4.0183808 6.4746806
8 3.6415608 11.0308990
9 4.5237453 7.3255129
10 6.9379168 9.4594150
11 5.1557935 11.6776181
12 2.3829337 3.5170335
13 4.3556430 7.9706624
14 7.3274615 8.1852829
15 -0.5650641 2.8109197
16 7.1742283 6.8161200
17 3.3412044 11.6298940
18 2.5388981 10.1289533
19 3.8845686 14.1517643
20 2.4431608 6.8374837
21 4.8731053 12.7258259
22 6.9534912 6.5069513
23 4.4394807 14.5320225
24 2.0427553 12.1786148
25 7.1563978 11.9671603
26 2.4231207 6.1801862
27 6.5830372 0.9814878
28 2.5443326 9.8774632
29 1.1260322 9.4804636
30 4.0078436 12.9909014
31 9.3599808 12.2178596
32 3.5362245 8.6758910
33 4.6462337 8.6647953
34 2.0698037 7.2750532
35 7.0727970 8.9386798
36 4.8465248 8.0565347
37 5.6084462 7.5676308
38 6.7617479 9.5357666
39 5.2138482 13.6822924
40 3.6259103 13.8659939
41 5.8586547 6.5087016
42 4.3490281 9.5367522
43 7.5130701 8.1699117
44 3.7933813 9.3241308
45 4.9466813 9.4432584
46 -0.3730035 6.4695187
47 2.0646458 10.6511916
48 4.6027309 4.9207746
49 5.9919348 7.1946723
50 6.0148330 13.4702419
51 5.5354452 9.0193366
52 5.2621651 12.8856488
53 6.8580210 6.3526151
54 8.0812166 14.4659778
55 3.6039030 5.9857886
56 9.8548553 15.9081336
57 3.3675037 14.7207681
58 3.9935336 14.3186175
59 3.4308085 10.6024579
60 3.9609624 6.6595521
61 4.2358603 10.6600581
62 5.1791856 9.3241118
63 4.6976289 13.2833055
64 5.1868906 7.1323826
65 3.1810915 12.8402472
66 6.0258287 9.3805249
67 5.3768112 6.3805096
68 5.7072092 7.1130150
69 6.5789349 8.0092541
70 5.3175820 17.3377234
71 9.7706112 10.8648956
72 5.2332127 12.3418373
73 4.7626124 13.8816910
74 3.9395911 6.5270785
75 6.4394724 10.6344965
76 2.6803695 10.4501753
77 3.5577834 8.2323369
78 5.8431140 7.7932460
79 2.8596818 8.9581837
80 2.7365174 10.2902512
81 4.7560973 6.4555758
82 4.6519084 8.9786777
83 4.9467471 11.2818536
84 5.6167284 5.2641380
85 9.4700525 2.9904731
86 4.7392906 11.3572521
87 3.1221908 6.3881556
88 5.6949432 7.4518023
89 5.1435241 10.8912283
90 2.1628966 10.5080671
91 3.6380837 15.0594135
92 5.3434709 7.4034042
93 -0.1298439 0.4832707
94 7.8759390 2.7411723
95 2.0898649 9.7687250
96 4.2131549 9.3175228
97 5.0648105 11.3943350
98 7.7225193 11.4180456
99 3.1018895 12.8890257
100 4.4166832 10.4901303

R:how to rewrite my coding to work in high efficiency?

I have a matrix (named rating) with dim n x 140000 and another matrix (named trust) with dim nxn where n varying when I change the group and n might have value from 1-15000. I need to multiply each column of rating by trust. for example:
trust= rating=
a1 a2 a3 a4 a5 1 2 3 4 5 6 7 8
b1 b2 b3 b4 b5 2 5 7 8 9 2 1 6
c1 c2 c3 c4 c5 3 5 3 6 8 1 2 5
d1 d2 d3 d4 d5 4 7 8 2 4 5 6 7
e1 e2 e3 e4 e5 5 2 5 7 8 9 1 4
answer1= answer2=
a1.1 a2.2 a3.3 a4.4 a5.5 a1.2 a2.5 a3.5 a4.7 a5.2
b1.1 b2.2 b3.3 b4.4 b5.5 b1.2 b2.5 b3.5 b4.7 b5.2
c1.1 c2.2 c3.3 c4.4 c5.5 c1.2 c2.5 c3.5 c4.7 c5.2
d1.1 d2.2 d3.3 d4.4 d5.5 d1.2 d2.5 d3.5 d4.7 d5.2
e1.1 e2.2 e3.3 e4.4 e5.5 e1.2 e2.5 e3.5 e4.7 e5.2
and answer3 must multiply by 3rd column and so on. Then add each rows of answer1, answer2, ... and store into a vector. Then store each vector into a list for future use.
for (k in 1:ncol(rating)) {
clmy <- as.matrix(rating[, k])
answer <- sweep(trust, MARGIN = 2, clmy, '*')
sumtrustbyrating <- rowSums(answer)
LstsumRbyT[[k]] <- sumtrustbyrating
sumtrustbyrating = NULL
}
It is working perfectly if I change the ncol(rating) to a small value (about 100). But for the actual data, I have 140000 columns. It takes time and I couldn't get the final execution result. Please help me to enhance the performance of my code for a huge data set.
How about a matrix product? Or is that too slow?
rating <- matrix(c(1, 2, 3, 4, 5,2, 5, 5, 6, 3, 3, 4, 1, 2, 1), ncol=3)
trust <- matrix(rep(1:5, rep(5, 1)), 5, byrow=TRUE)
Running your code above yields
LstsumRbyT
[[1]]
[1] 55 55 55 55 55
[[2]]
[1] 66 66 66 66 66
[[3]]
[1] 27 27 27 27 27
which is the same as
trust %*% rating
[,1] [,2] [,3]
[1,] 55 66 27
[2,] 55 66 27
[3,] 55 66 27
[4,] 55 66 27
[5,] 55 66 27
If this isn't enough then this could be improved a bit in RCppArmadillo I guess.
To add to the benchmarking discussion. If your for loop above is renamed f() then I get
microbenchmark(trust %*% rating, f())
Unit: microseconds
expr min lq mean median uq max neval cld
trust %*% rating 1.418 1.7010 2.97663 2.7215 3.5965 14.452 100 a
f() 593.890 700.9775 764.00515 766.5535 792.6375 1511.104 100 b
which is quite a substantial speedup with the normal matrix product.
I would vectorize everything:
library(data.table)
set.seed(666)#in order to have reproducible results
n<-10#number of cols and rows
(trust<-matrix(runif(n*n),ncol=n,nrow=n))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.77436849 0.77589308 0.98422408 0.4697785 0.2444375 0.06913359 0.7748744 0.60379428 0.7659585 0.13247078
[2,] 0.19722419 0.01637905 0.60134555 0.3976166 0.5309707 0.08462063 0.8120639 0.32826395 0.7758464 0.07851311
[3,] 0.97801384 0.09574478 0.03834435 0.8046367 0.1183959 0.12994557 0.2606025 0.66611781 0.3125150 0.37822385
[4,] 0.20132735 0.14216354 0.14149569 0.5088974 0.9833834 0.74613202 0.6515950 0.87478750 0.8422173 0.57962476
[5,] 0.36124443 0.21112624 0.80638553 0.6349154 0.8977528 0.03887918 0.9238039 0.06887527 0.3141499 0.53642512
[6,] 0.74261194 0.81125644 0.26668568 0.4942517 0.7385738 0.68563542 0.2661061 0.79346301 0.7565639 0.10853192
[7,] 0.97872844 0.03654720 0.04270205 0.2801309 0.3773107 0.14397736 0.2661330 0.57142701 0.9675244 0.74031515
[8,] 0.49811371 0.89163741 0.61217452 0.9087104 0.6061688 0.89107996 0.9109179 0.04894407 0.1694229 0.45178964
[9,] 0.01331584 0.48323641 0.55334840 0.7841162 0.5121943 0.08963612 0.5905635 0.98035135 0.6968752 0.64610821
[10,] 0.25994613 0.46666453 0.85350077 0.5589970 0.9892467 0.03773272 0.9181476 0.91453735 0.8726508 0.74929873
(rating<-matrix(sample(n*n),ncol=n,nrow=n))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 58 19 13 25 23 96 38 100 47 93
[2,] 37 22 45 41 4 18 52 83 89 39
[3,] 87 36 15 40 94 11 31 63 35 10
[4,] 59 88 81 64 68 27 92 56 49 46
[5,] 24 90 8 44 43 82 14 57 79 66
[6,] 95 74 48 70 7 33 34 42 60 50
[7,] 26 65 73 61 32 12 97 98 9 69
[8,] 21 86 1 99 6 72 75 20 71 62
[9,] 29 85 55 30 53 80 77 2 28 51
[10,] 67 91 76 16 5 3 84 54 78 17
A function:
prod1<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(1:nrow(m2)),function(y) {lapply(seq_along(1:nrow(m1)[1]),function(x){m1[,x]*m2[y,x]})})))$V1))
return(res)
}
will produce: (answer1<-prod1(trust,rating))#sequence of arguments DOES matter
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 44.9133724 14.7419685 12.7949130 11.744463 5.622062 6.636824 29.445226 60.379428 36.000049 12.319782
2: 11.4390031 0.3112020 7.8174921 9.940414 12.212325 8.123580 30.858427 32.826395 36.464780 7.301719
3: 56.7248030 1.8191509 0.4984765 20.115918 2.723107 12.474775 9.902897 66.611781 14.688207 35.174818
4: 11.6769863 2.7011073 1.8394440 12.722435 22.617819 71.628674 24.760610 87.478750 39.584213 53.905103
5: 20.9521768 4.0113985 10.4830118 15.872884 20.648315 3.732401 35.104546 6.887527 14.765046 49.887537
6: 43.0714926 15.4138724 3.4669138 12.356293 16.987197 65.821000 10.112033 79.346301 35.558503 10.093469
7: 56.7662495 0.6943967 0.5551267 7.003272 8.678146 13.821827 10.113054 57.142701 45.473646 68.849309
8: 28.8905951 16.9411108 7.9582688 22.717759 13.941883 85.543676 34.614880 4.894407 7.962877 42.016436
9: 0.7723185 9.1814918 7.1935292 19.602904 11.780468 8.605067 22.441414 98.035135 32.753133 60.088064
10: 15.0768755 8.8666260 11.0955099 13.974926 22.752673 3.622341 34.889611 91.453735 41.014587 69.684782
Finally the answer2 is given via the function
prod2<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(2:nrow(m2)),function(y) {lapply(seq_along(2:nrow(m1)[1]),function(x){m1[,x]*m2[y,x+1]})})))$V1))
return(res)
}
and in particular answer2<-prod2(trust,rating), yielding:
V1 V2 V3 V4 V5 V6 V7 V8 V9
1: 14.7130013 10.0866100 24.6056020 10.804906 23.46600 2.627076 77.48744 28.378331 71.23414
2: 3.7472596 0.2129277 15.0336387 9.145181 50.97318 3.215584 81.20639 15.428406 72.15371
3: 18.5822630 1.2446822 0.9586087 18.506645 11.36601 4.937932 26.06025 31.307537 29.06390
4: 3.8252197 1.8481260 3.5373923 11.704640 94.40481 28.353017 65.15950 41.115012 78.32621
5: 6.8636441 2.7446411 20.1596381 14.603053 86.18427 1.477409 92.38039 3.237138 29.21594
6: 14.1096269 10.5463338 6.6671419 11.367790 70.90308 26.054146 26.61061 37.292761 70.36044
7: 18.5958403 0.4751135 1.0675513 6.443011 36.22183 5.471140 26.61330 26.857069 89.97977
8: 9.4641605 11.5912864 15.3043631 20.900338 58.19221 33.861038 91.09179 2.300371 15.75633
9: 0.2530009 6.2820733 13.8337100 18.034672 49.17065 3.406172 59.05635 46.076514 64.80939
10: 4.9389764 6.0666389 21.3375191 12.856932 94.96768 1.433843 91.81476 42.983255 81.15652
Benchmarking
library(microbenchmark)
library("ggplot2")
set.seed(666)
global_func<-function(n){
trust<-matrix(runif(n*n),ncol=n,nrow=n)
rating<-matrix(sample(n*n),ncol=n,nrow=n)
prod1<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(1:nrow(m2)),function(y) {lapply(seq_along(1:nrow(m1)[1]),function(x){m1[,x]*m2[y,x]})})))$V1))
return(res)
}
prod2<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(2:nrow(m2)),function(y) {lapply(seq_along(2:nrow(m1)[1]),function(x){m1[,x]*m2[y,x+1]})})))$V1))
return(res)
}
return(list(prod1(trust,rating),prod2(trust,rating)))
}
Let's compare times vs number of cols/rows (n)---Use with caution
tm<-microbenchmark(global_func(10),
global_func(50),
global_func(100),
global_func(500),
times = 100
)
autoplot(tm)

Loop over matrix using n consecutive rows in R

I have a matrix that consists of two columns and a number (n) of rows, while each row represents a point with the coordinates x and y (the two columns).
This is what it looks (LINK):
V1 V2
146 17
151 19
153 24
156 30
158 36
163 39
168 42
173 44
...
now, I would like to use a subset of three consecutive points starting from 1 to do some fitting, save the values from this fit in another list, an den go on to the next 3 points, and the next three, ... till the list is finished. Something like this:
Data_Fit_Kasa_1 <- CircleFitByKasa(Data[1:3,])
Data_Fit_Kasa_2 <- CircleFitByKasa(Data[3:6,])
....
Data_Fit_Kasa_n <- CircleFitByKasa(Data[i:i+2,])
I have tried to construct a loop, but I can't make it work. R either tells me that there's an "unexpected '}' in "}" " or that the "subscript is out of bonds". This is what I've tried:
minimal runnable code
install.packages("conicfit")
library(conicfit)
CFKasa <- NULL
Data.Fit <- NULL
for (i in 1:length(Data)) {
row <- Data[i:(i+2),]
CFKasa <- CircleFitByKasa(row)
Data.Fit[i] <- CFKasa[3]
}
RStudio Version 0.99.902 – © 2009-2016 RStudio, Inc.; Win10 Edu.
The third element of the fitted circle (CFKasa[3]) represents the radius, which is what I am really interested in. I am really stuck here, please help.
Many thanks in advance!
Best, David
Turn your data into a 3D array and use apply:
DF <- read.table(text = "V1 V2
146 17
151 19
153 24
156 30
158 36
163 39", header = TRUE)
a <- t(DF)
dim(a) <-c(nrow(a), 3, ncol(a) / 3)
a <- aperm(a, c(2, 1, 3))
# , , 1
#
# [,1] [,2]
# [1,] 146 17
# [2,] 151 19
# [3,] 153 24
#
# , , 2
#
# [,1] [,2]
# [1,] 156 30
# [2,] 158 36
# [3,] 163 39
center <- function(m) c(mean(m[,1]), mean(m[,2]))
t(apply(a, 3, center))
# [,1] [,2]
#[1,] 150 20
#[2,] 159 35
center(DF[1:3,])
#[1] 150 20

R: reshape data by chunks - more elegant way

I stumble upon the following thing. I read the reshape manual, but still lost.
Is there an efficient and more elegant way to reshape the matrix of even chunks?
the code to generate the matrix and reshaped matrix is below.
# current matrix
x <- matrix(sample(20*9), 20, 9)
colnames(x) <- c(paste("time",c(1:3),sep="_"),
paste("SGNL", 1, c(1:3), sep="_"),
paste("SGNL", 2, c(1:3), sep="_"))
# reshaped matrix
x.reshaped <- rbind( x[,c(1,4,7)], x[,c(2,5,8)], x[,c(3,6,9)] )
colnames(x.reshaped) <- sub("\\_1$", "", colnames(x.reshaped))
Thanks!
If you want to use an approach that is name-based and not position-based, then you should look at melt from "data.table":
library(data.table)
melt(as.data.table(x), measure.vars = patterns("time", "SGNL_1", "SGNL_2"))
Example output:
head(melt(as.data.table(x), measure.vars = patterns("time", "SGNL_1", "SGNL_2")))
# variable value1 value2 value3
# 1: 1 48 110 155
# 2: 1 67 35 140
# 3: 1 102 55 72
# 4: 1 161 39 66
# 5: 1 36 137 99
# 6: 1 158 169 85
Or, in base R:
patts <- c("time", "SGNL_1", "SGNL_2")
sapply(patts, function(y) c(x[, grep(y, colnames(x))]))
# time SGNL_1 SGNL_2
# [1,] 48 110 155
# [2,] 67 35 140
# [3,] 102 55 72
# [4,] 161 39 66
# [5,] 36 137 99
# .
# .
# .
# .
# [56,] 13 1 84
# [57,] 40 46 95
# [58,] 152 7 178
# [59,] 81 79 123
# [60,] 50 101 146
Data generated with set.seed(1).
We could create the subset of matrices (based on the index generated by the seq) in a list and then rbind it together.
do.call(rbind, lapply(1:3, function(i) x[,seq(i, length.out=3, by=3)]))
Or using a for loop
m2 <- c()
for(i in 1:3) { m2 <- rbind(m2, x[,seq(i, length.out=3, by=3)])}
x[,c(matrix(1:9, 3, byrow=TRUE))] # or shorter:
x[,matrix(1:9, 3, byrow=TRUE)]

R - Apply function with different argument value for each row/column of a matrix

I am trying to apply a function to each row or column of a matrix, but I need to pass a different argument value for each row.
I thought I was familiar with lapply, mapply etc... But probably not enough.
As a simple example :
> a<-matrix(1:100,ncol=10);
> a
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 11 21 31 41 51 61 71 81 91
[2,] 2 12 22 32 42 52 62 72 82 92
[3,] 3 13 23 33 43 53 63 73 83 93
[4,] 4 14 24 34 44 54 64 74 84 94
[5,] 5 15 25 35 45 55 65 75 85 95
[6,] 6 16 26 36 46 56 66 76 86 96
[7,] 7 17 27 37 47 57 67 77 87 97
[8,] 8 18 28 38 48 58 68 78 88 98
[9,] 9 19 29 39 49 59 69 79 89 99
[10,] 10 20 30 40 50 60 70 80 90 100
Let's say I want to apply a function to each row, I would do :
apply(a, 1, myFunction);
However my function takes an argument, so :
apply(a, 1, myFunction, myArgument);
But if I want my argument to take a different value for each row, I cannot find the right way to do it.
If I define a 'myArgument' with multiple values, the whole vector will obviously be passed to each call of 'myFunction'.
I think that I would need a kind of hybrid between apply and the multivariate mapply. Does it make sense ?
One 'dirty' way to achieve my goal is to split the matrix by rows (or columns), use mapply on the resulting list and merge the result back to a matrix :
do.call(rbind, Map(myFunction, split(a,row(a)), as.list(myArgument)));
I had a look at sweep, aggregate, all the *apply variations but I wouldn't find the perfect match to my need. Did I miss it ?
Thank you for your help.
You can use sweep to do that.
a <- matrix(rnorm(100),10)
rmeans <- rowMeans(a)
a_new <- sweep(a,1,rmeans,`-`)
rowMeans(a_new)
I don't think there are any great answers, but you can somewhat simplify your solution by using mapply, which handles the "rbind" part for you, assuming your function always returns the same sizes vector (also, Map is really just mapply):
a <- matrix(1:80,ncol=8)
myFun <- function(x, y) (x - mean(x)) * y
myArg <- 1:nrow(a)
t(mapply(myFun, split(a, row(a)), myArg))
I know the topic is quiet old but I had the same issue and I solved it that way:
# Original matrix
a <- matrix(runif(n=100), ncol=5)
# Different value for each row
v <- runif(n=nrow(a))
# Result matrix -> Add a column with the row number
o <- cbind(1:nrow(a), a)
fun <- function(x, v) {
idx <- 2:length(x)
i <- x[1]
r <- x[idx] / v[i]
return(r)
}
o <- t(apply(o, 1, fun, v=v)
By adding a column with the row number to the left of the original matrix, the index of the needed value from the argument vector can be received from the first column of the data matrix.

Resources