R function only the last value of a vector is printed - r

Thank you for your help with this function, which should :
Enter a a specific value
Insert the value in a function
Take a set of other values generated in a vector
Calculate a value for each element of a vector
Return a data frame with both vector and calculated values.
Here is what I tried:
rate<-function(Y2) {
ran<-seq(0.001,1,0.001)
for(i in ran) {
calculated<-as.vector(Y2/(1+i)+Y2/(1+i)^2+Y2/(1+i)^3+Y2/(1+i)^4)
tableau<-data.frame(ran,calculated)
}
return(tableau)
}
When testing with res<-rate(500), only the last value is returned 1000 times:
...
ran calculated
1 0.001 468.75
2 0.002 468.75
3 0.003 468.75
...
996 0.996 468.75
997 0.997 468.75
998 0.998 468.75
999 0.999 468.75
1000 1.000 468.75
What is wrong with my loop?

You're assigning the output of your as.vector(...) calculation to the same variable each time you loop. Then you're building a data.frame, named tableau each time you loop. You're only returning the last iteration. If you want to save each iteration, you'll need to index into something:
res[n] <- as.vector(...)
Or the more R-ish version, use one of the apply family (specifically lapply) and no loop at all:
rate <- function(Y2) {
ran <- seq(0.001, 1, 0.001)
result <- lapply(ran,
function(i) data.frame(ran = i,
calculated = as.vector(Y2/(1+i)+Y2/(1+i)^2+Y2/(1+i)^3+Y2/(1+i)^4)))
return (do.call(rbind, result))
}
With that said, there is no reason for a loop or an apply function. Use the fact that R is vectorized:
ran <- seq(0.001, 1, 0.001)
Y2 <- 500
calculated <- as.vector(Y2/(1+ran)+Y2/(1+ran)^2+Y2/(1+ran)^3+Y2/(1+ran)^4)
result <- data.frame(ran, calculated)
all.equal(result, rate(500))
# [1] TRUE

Related

times of function apply()? apply() vs for loop? [duplicate]

It is often said that one should prefer lapply over for loops.
There are some exception as for example Hadley Wickham points out in his Advance R book.
(http://adv-r.had.co.nz/Functionals.html) (Modifying in place, Recursion etc).
The following is one of this case.
Just for sake of learning, I tried to rewrite a perceptron algorithm in a functional form in order to benchmark
relative performance.
source (https://rpubs.com/FaiHas/197581).
Here is the code.
# prepare input
data(iris)
irissubdf <- iris[1:100, c(1, 3, 5)]
names(irissubdf) <- c("sepal", "petal", "species")
head(irissubdf)
irissubdf$y <- 1
irissubdf[irissubdf[, 3] == "setosa", 4] <- -1
x <- irissubdf[, c(1, 2)]
y <- irissubdf[, 4]
# perceptron function with for
perceptron <- function(x, y, eta, niter) {
# initialize weight vector
weight <- rep(0, dim(x)[2] + 1)
errors <- rep(0, niter)
# loop over number of epochs niter
for (jj in 1:niter) {
# loop through training data set
for (ii in 1:length(y)) {
# Predict binary label using Heaviside activation
# function
z <- sum(weight[2:length(weight)] * as.numeric(x[ii,
])) + weight[1]
if (z < 0) {
ypred <- -1
} else {
ypred <- 1
}
# Change weight - the formula doesn't do anything
# if the predicted value is correct
weightdiff <- eta * (y[ii] - ypred) * c(1,
as.numeric(x[ii, ]))
weight <- weight + weightdiff
# Update error function
if ((y[ii] - ypred) != 0) {
errors[jj] <- errors[jj] + 1
}
}
}
# weight to decide between the two species
return(errors)
}
err <- perceptron(x, y, 1, 10)
### my rewriting in functional form auxiliary
### function
faux <- function(x, weight, y, eta) {
err <- 0
z <- sum(weight[2:length(weight)] * as.numeric(x)) +
weight[1]
if (z < 0) {
ypred <- -1
} else {
ypred <- 1
}
# Change weight - the formula doesn't do anything
# if the predicted value is correct
weightdiff <- eta * (y - ypred) * c(1, as.numeric(x))
weight <<- weight + weightdiff
# Update error function
if ((y - ypred) != 0) {
err <- 1
}
err
}
weight <- rep(0, 3)
weightdiff <- rep(0, 3)
f <- function() {
t <- replicate(10, sum(unlist(lapply(seq_along(irissubdf$y),
function(i) {
faux(irissubdf[i, 1:2], weight, irissubdf$y[i],
1)
}))))
weight <<- rep(0, 3)
t
}
I did not expected any consistent improvement due to the aforementioned
issues. But nevertheless I was really surprised when I saw the sharp worsening
using lapply and replicate.
I obtained this results using microbenchmark function from microbenchmark library
What could possibly be the reasons?
Could it be some memory leak?
expr min lq mean median uq
f() 48670.878 50600.7200 52767.6871 51746.2530 53541.2440
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 4184.131 4437.2990 4686.7506 4532.6655 4751.4795
perceptronC(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 95.793 104.2045 123.7735 116.6065 140.5545
max neval
109715.673 100
6513.684 100
264.858 100
The first function is the lapply/replicate function
The second is the function with for loops
The third is the same function in C++ using Rcpp
Here According to Roland the profiling of the function.
I am not sure I can interpret it in the right way.
It looks like to me most of the time is spent in subsetting
Function profiling
First of all, it is an already long debunked myth that for loops are any slower than lapply. The for loops in R have been made a lot more performant and are currently at least as fast as lapply.
That said, you have to rethink your use of lapply here. Your implementation demands assigning to the global environment, because your code requires you to update the weight during the loop. And that is a valid reason to not consider lapply.
lapply is a function you should use for its side effects (or lack of side effects). The function lapply combines the results in a list automatically and doesn't mess with the environment you work in, contrary to a for loop. The same goes for replicate. See also this question:
Is R's apply family more than syntactic sugar?
The reason your lapply solution is far slower, is because your way of using it creates a lot more overhead.
replicate is nothing else but sapply internally, so you actually combine sapply and lapply to implement your double loop. sapply creates extra overhead because it has to test whether or not the result can be simplified. So a for loop will be actually faster than using replicate.
inside your lapply anonymous function, you have to access the dataframe for both x and y for every observation. This means that -contrary to in your for-loop- eg the function $ has to be called every time.
Because you use these high-end functions, your 'lapply' solution calls 49 functions, compared to your for solution that only calls 26. These extra functions for the lapply solution include calls to functions like match, structure, [[, names, %in%, sys.call, duplicated, ...
All functions not needed by your for loop as that one doesn't do any of these checks.
If you want to see where this extra overhead comes from, look at the internal code of replicate, unlist, sapply and simplify2array.
You can use the following code to get a better idea of where you lose your performance with the lapply. Run this line by line!
Rprof(interval = 0.0001)
f()
Rprof(NULL)
fprof <- summaryRprof()$by.self
Rprof(interval = 0.0001)
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10)
Rprof(NULL)
perprof <- summaryRprof()$by.self
fprof$Fun <- rownames(fprof)
perprof$Fun <- rownames(perprof)
Selftime <- merge(fprof, perprof,
all = TRUE,
by = 'Fun',
suffixes = c(".lapply",".for"))
sum(!is.na(Selftime$self.time.lapply))
sum(!is.na(Selftime$self.time.for))
Selftime[order(Selftime$self.time.lapply, decreasing = TRUE),
c("Fun","self.time.lapply","self.time.for")]
Selftime[is.na(Selftime$self.time.for),]
There is more to the question of when to use for or lapply and which "performs" better. Sometimes speed is important, other times memory is important. To further complicate things, the time complexity may not be what you expect - that is, different behavior can be observed at different scopes, invalidating any blanket statement such as "faster than" or "at least as fast as". Finally, one performance metric often overlooked is thought-to-code, pre-mature optimization yada yada.
That said, in the Introduction to R the authors hint at some performance concerns:
Warning: for() loops are used in R code much less often than in compiled languages. Code that takes a ‘whole object’ view is likely to be both clearer and faster in R.
Given a similar use case, input and output, disregarding user preferences, is one clearly better than the other?
Benchmark - Fibonacci sequence
I compare approaches to compute 1 to N Fibonacci numbers (inspired by the benchmarkme package), shunning the 2nd Circle and ensuring that inputs and outputs for each approach are the same. Four additional approaches are included to throw some oil on the fire - a vectorized approach and purrr::map, and *apply variants vapply and sapply.
fib <- function(x, ...){
x <- 1:x ; phi = 1.6180339887498949 ; v = \() vector("integer", length(x))
bench::mark(
vector = {
y=v(); y = ((rep(phi, length(x))^x) - ((-rep(phi, length(x)))^-x)) / sqrt(5); y},
lapply = {
y=v(); y = unlist(lapply(x, \(.) (phi^. - (-phi)^(-.)) / sqrt(5)), use.names = F); y},
loop = {
y=v(); `for`(i, x, {y[i] = (phi^i - (-phi)^(-i)) / sqrt(5)}); y},
sapply = {
y=v(); y = sapply(x, \(.) (phi^. - (-phi)^(-.)) / sqrt(5)); y},
vapply = {
y=v(); y = vapply(x, \(.) (phi^. - (-phi)^(-.)) / sqrt(5), 1); y},
map = {
y=v(); y <- purrr::map_dbl(x, ~ (phi^. - (-phi)^(-.))/sqrt(5)); y
}, ..., check = T
)[c(1:9)]
}
Here is a comparison of the performance, ranked by median time.
lapply(list(3e2, 3e3, 3e4, 3e5, 3e6, 3e7), fib) # n iterations specified separately
N = 300
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
1 vector 38.8us 40.9us 21812. 8.44KB 0 1000 0 45.8ms
2 vapply 500us 545us 1653. 3.61KB 1.65 999 1 604ms
3 sapply 518us 556us 1725. 12.48KB 0 1000 0 580ms
4 lapply 513.4us 612.8us 1620. 6KB 8.14 995 5 614.2ms
5 loop 549.9us 633.6us 1455. 3.61KB 8.78 994 6 683.3ms
6 map 649.6us 754.6us 1312. 3.61KB 9.25 993 7 756.9ms
N = 3000
1 vector 769.7us 781.5us 1257. 82.3KB 1.26 999 1 794.83ms
2 vapply 5.38ms 5.58ms 173. 35.2KB 0.697 996 4 5.74s
3 sapply 5.59ms 5.83ms 166. 114.3KB 0.666 996 4 6.01s
4 loop 5.38ms 5.91ms 167. 35.2KB 8.78 950 50 5.69s
5 lapply 5.24ms 6.49ms 156. 58.7KB 8.73 947 53 6.07s
6 map 6.11ms 6.63ms 148. 35.2KB 9.13 942 58 6.35s
N = 30 000
1 vector 10.7ms 10.9ms 90.9 821KB 0.918 297 3 3.27s
2 vapply 57.3ms 60.1ms 16.4 351.66KB 0.741 287 13 17.5s
3 loop 59.2ms 60.7ms 15.9 352KB 16.7 146 154 9.21s
4 sapply 59.6ms 62.1ms 15.7 1.05MB 0.713 287 13 18.2s
5 lapply 57.3ms 67.6ms 15.1 586KB 20.5 127 173 8.43s
6 map 66.7ms 69.1ms 14.4 352KB 21.6 120 180 8.35s
N = 300 000
1 vector 190ms 193ms 5.14 8.01MB 0.206 100 4 19.45s
2 loop 693ms 713ms 1.40 3.43MB 7.43 100 532 1.19m
3 map 766ms 790ms 1.26 3.43MB 7.53 100 598 1.32m
4 vapply 633ms 814ms 1.33 3.43MB 0.851 100 39 45.8s
5 lapply 685ms 966ms 1.06 5.72MB 9.13 100 864 1.58m
6 sapply 694ms 813ms 1.27 12.01MB 0.810 100 39 48.1s
N = 3 000 000
1 vector 3.17s 3.21s 0.312 80.1MB 0.249 20 16 1.07m
2 vapply 8.22s 8.37s 0.118 34.3MB 4.97 20 845 2.83m
3 loop 8.3s 8.42s 0.119 34.3MB 4.35 20 733 2.81m
4 map 9.09s 9.17s 0.109 34.3MB 4.91 20 903 3.07m
5 lapply 10.42s 11.09s 0.0901 57.2MB 4.10 20 909 3.7m
6 sapply 10.43s 11.28s 0.0862 112.1MB 3.58 20 830 3.87m
N = 30 000 000
1 vector 44.8s 45.94s 0.0214 801MB 0.00854 10 4 7.8m
2 vapply 1.56m 1.6m 0.0104 343MB 0.883 10 850 16m
3 loop 1.56m 1.62m 0.00977 343MB 0.366 10 374 17.1m
4 map 1.72m 1.74m 0.00959 343MB 1.23 10 1279 17.4m
5 lapply 2.15m 2.22m 0.00748 572MB 0.422 10 565 22.3m
6 sapply 2.05m 2.25m 0.00747 1.03GB 0.405 10 542 22.3m
# Intel i5-8300H CPU # 2.30GHz / R version 4.1.1 / purrr 0.3.4
for and lapply approaches perform similarly, but lapply is greedier when it comes to memory, and a bit slower when the size of input increases (for this task). Note that purrr::map memory usage is equivalent to the for-loop, superior to that of lapply, in itself a debated topic. However, when the appropriate *apply* is used, here vapply, the performance is similar. But the choice could have a large impact on memory use, sapply being noticeably less memory efficient than vapply.
A peek under the hood reveals the reason of different performance for the approaches. The for-loop performs many type checks, resulting in some overhead. lapply on the other hand, suffers from a flawed language design where lazy evaluation, or use of promises, comes at a cost, the source code confirming that the X and FUN arguments to .Internal(lapply) are promises.
Vectorized approaches are fast, and probably desirable over a for or lapply approach. Notice how the vectorized approach grows irregularly compared to the other approaches. However, aesthetics of vectorized code may be a concern: which approach would you prefer to debug?
Overall, I'd say a choice between lapply or for is not something the average R user should ponder over. Stick to what is easiest to write, think of, and debug or that is less (silent?) error prone. What is lost in performance will likely be canceled out by time saved writing. For performance critical applications, make sure to run some tests with different input sizes and to properly chunk code.
Actually,
I did test the difference with a a problem that a solve recently.
Just try yourself.
In my conclusion, have no difference but for loop to my case were insignificantly more faster than lapply.
Ps: I try mostly keep the same logic in use.
ds <- data.frame(matrix(rnorm(1000000), ncol = 8))
n <- c('a','b','c','d','e','f','g','h')
func <- function(ds, target_col, query_col, value){
return (unique(as.vector(ds[ds[query_col] == value, target_col])))
}
f1 <- function(x, y){
named_list <- list()
for (i in y){
named_list[[i]] <- func(x, 'a', 'b', i)
}
return (named_list)
}
f2 <- function(x, y){
list2 <- lapply(setNames(nm = y), func, ds = x, target_col = "a", query_col = "b")
return(list2)
}
benchmark(f1(ds2, n ))
benchmark(f2(ds2, n ))
As you could see, I did a simple routine to build a named_list based in a dataframe, the func function does the column values extracted, the f1 uses a for loop to iterate through the dataframe and the f2 uses a lapply function.
In my computer I get this results:
test replications elapsed relative user.self sys.self user.child
1 f1(ds2, n) 100 110.24 1 110.112 0 0
sys.child
1 0
&&
test replications elapsed relative user.self sys.self user.child
1 f1(ds2, n) 100 110.24 1 110.112 0 0
sys.child
1 0

How to pass list elements to model in R?

I am quite new to the use of lists so I apologize if this problem may sound very dumb.
From an original set of 459,046 customers, I have created a function that splits and stores the base in several elements of a list.
sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
Executing this function (baseSample) you will get a new object list, containing mutually exclusive groups of customers (each group will be made of 10,000 customers - apart from the last one who may be smaller, depending on the initial volume)
> sampled_list <- baseSample(dataset = clv_df_cbs, sample.size = 10000, seed = 12345)
[1] "Seed: 12345"
[1] "Total groups created: 46"
[1] "Group size: 10000"
In this case, the output is a list of 46 elements stored in the object called sample_list.
Now, I want to pass each of these 46 elements to a BTYD model that will forecast the number of transactions in the next 90 days (given the learnings from the input).
The reason why I cannot pass the full dataset to the BTYD model is because this model heavily uses mcmc, therefore there is a long time of calculation that stops the model to provide any output. So I have decided to generate forecasts running the same model several times (on sample big enough) until I manage to pass all the base as model input.
The operations that need to be performed on each of the elements are the following
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(element1,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(element1, pggg.draws1)
# conditional expectations
element1$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
element1$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
element1$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers
head(element1[, c("x", "t.x", "x.star",
"xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
})
round(apply(median.est1, 1, mean), 3)
Ideally, the output should be stored straight into a new data.frame - so I can retrieve the Id and the forecast (amongst other stuff originally included in the dataset).
Here below some mock data to play with from a publicly available dataset.
library(BTYDplus)
library(tidyverse)
data("groceryElog")
dataset<-elog2cbs(groceryElog, T.cal = "2006-12-01")
# FUNCTION baseSample ####
baseSample <- function(dataset, sample.size, seed=NULL) {
seed.value <- if(is.null(seed)) {
as.numeric(format(Sys.Date(),"%Y"))*10000+as.numeric(format(Sys.Date(),"%m"))*100+as.numeric(format(Sys.Date(),"%d"))
} else {
seed
}
set.seed(seed.value)
# RE-ORDER DATA FRAME (SAME LENGTH)
data <- with(dataset, dataset[order(sample(cust, nrow(dataset))),])
# BUILD A LIST OF DFs
set.sample.size <- sample.size
data$cycles_group <- paste0("sample_", ceiling(1:nrow(data)/set.sample.size))
df_list <- split(data, data$cycles_group)
print(paste0("Seed: ", seed.value))
print(paste0("Total groups created: ", length(unique(data$cycles_group))))
print(paste0("Group size: ", set.sample.size))
return(df_list)
#print(df_list)
}
# ** OUTPUT: Base split in lists ####
sampled_list <- baseSample(dataset = dataset, sample.size = 100, seed = 12345)
Thanks
In base R, you can use lapply to iterate a function over the elements of a list and return a new list with the results of those iterations. After using your example code to generate a list called sampled_list...
# turn the code for the operations you want to perform on each list element into a function,
# with a couple of minor tweaks
thingy <- function(i) {
# Estimate parameters for element1 of the list
pggg.draws1 <- pggg.mcmc.DrawParameters(i,
mcmc = 1000, # number of MCMC steps
burnin = 250, # number of initial MCMC steps which are discarded
thin = 10, # only every thin-th MCMC step will be returned
chains = 2, # number of MCMC chains to be run
trace = 50) # print logging step every trace iteration
# generate draws for holdout period
pggg.xstar.draws1 <- mcmc.DrawFutureTransactions(i, pggg.draws1)
# conditional expectations
i$xstar.pggg <- apply(pggg.xstar.draws1, 2, mean)
# P(active)
i$pactive.pggg <- mcmc.PActive(pggg.xstar.draws1)
# P(alive)
i$palive.pggg <- mcmc.PAlive(pggg.draws1)
# show estimates for first few customers [commenting out for this iterated version]
# head(element1[, c("x", "t.x", "x.star", "xstar.pggg", "pactive.pggg", "palive.pggg")],50)
# report median cohort-level parameter estimates
round(apply(as.matrix(pggg.draws1$level_2), 2, median), 3)
# report mean over median individual-level parameter estimates
median.est1 <- sapply(pggg.draws1$level_1, function(draw) {
apply(as.matrix(draw), 2, median)
})
# get the bits you want in a named vector
z <- round(apply(median.est1, 1, mean), 3)
# convert that named vector of results into a one-row data frame to make collapsing easier
data.frame(as.list(z))
}
# now use lapply to iterate that function over the elements of your list
results <- lapply(sampled_list, thingy)
# now bind the results into a data frame
boundresults <- do.call(rbind, results)
Results (which took a while to get):
k lambda mu tau z
sample_1 4.200 0.174 0.091 102.835 0.27
sample_10 3.117 0.149 0.214 128.143 0.29
sample_11 4.093 0.154 0.115 130.802 0.30
sample_12 4.191 0.142 0.053 114.108 0.33
sample_13 2.605 0.155 0.071 160.743 0.35
sample_14 9.196 0.210 0.084 111.747 0.36
sample_15 2.005 0.145 0.091 298.872 0.40
sample_16 2.454 0.111 0.019 78731750.121 0.70
sample_2 2.808 0.138 0.059 812.278 0.40
sample_3 4.327 0.166 0.116 559.318 0.42
sample_4 9.266 0.166 0.038 146.283 0.40
sample_5 3.277 0.157 0.073 105.915 0.33
sample_6 9.584 0.184 0.086 118.299 0.31
sample_7 4.244 0.189 0.118 54.945 0.23
sample_8 4.388 0.147 0.085 325.054 0.36
sample_9 7.898 0.181 0.052 83.892 0.33
You can also combine those last two steps into a single line of do.call(rbind, lapply(...)). If you want to make the row names in the results table into a column, you could do boundresults$sample <- row.names(boundresults) after making that table. And if you don't like creating new objects in your environment, you could put that function inside the call to lapply, i.e., lapply(sampled_list, function(i) { [your code] }).

Create multiplication sequence

I'm trying to create a list of numbers from 0.001 to 1000 with multiplication of 10. So the list would be (0.001, 0.1, 10.....1000)
Is there any function for it?
a <- 0.0001
b <- 0.0001
for (i in 1:5) {
b = b*100
print(c)
a <- c(a, b)
}
Can we replace the loop here with something more simple?
It should be simple as:
0.001 * 10^(seq(0,6,2))
# > 0.001 * 10^(seq(0,6,2))
# [1] 0.001 0.100 10.000 1000.000
Alternatively, you can use the GeometricSequence function from the bsts package in such way:
GeometricSequence(5, initial.value = 0.0001, discount.factor = 100)
First argument - length (in an example it is 5) is a positive integer giving the length of the desired sequence. initial.value - the first term in the sequence.
discount.factor - the ratio between a sequence term and the preceding term.

Selecting significant cases from a chi-squared test

I tried the loop function as was given in this question and it seems to work. However, I still have two problems. First I have 4753 comparisons, but R only lists those from 1946 to 4752. Is there a way to get the previous 1945 cases? I've already changed the length of my console to 100000 lines, but that does not seem to work.
1946 1946 focushumrights pillar4info 0.867 1 0.352
1947 1947 focushumrights pillar4campagne 0.053 1 0.818
...
4752 4752 improveorglearning improvenetwork 49.064 9 0.000
4753 4753 improvetechexpert improvenetwork 43.738 9 0.000
Second, I get 4753 results and of those, only a few are significant. Is there a way to automatically filter out the significant cases based on the "p-value" smaller than 0.1 or 0.05.
You are confusing with what is being displayed and what is being stored. I presume you are using the answer in the question you reference in your own question. The answer is a function that returns a data frame. You should store the data frame and then select rows as need. For example,
##Example function that returns a data frame
f = function(N=1000){
out <- data.frame("Row" = 1:N
, "Column" = 1:N
, "Chi.Square" = runif(N)
, "df"= sample(N, 1:10, replace=T)
, "p.value" = round(runif(N), 3)
)
return(out)
}
#Would just print everything to the screen
f()
##Store in a data frame
results = f()
##Select rows as needed
results[results$p.value < 0.05,]

Iterating over the big matrix containing 3000 rows and calculate the correlation.-Follow-up!

Thanks Nico! Almost got there after I corrected small bugs. Here I attache my script:
datamatrix=read.table("ref.txt", sep="\t", header=T, row.names=1)
correl <- NULL
for (i in 1:nrow(datamatrix)) {
correl <- apply(datamatrix, 1, function(x) {cor(t(datamatrix[, i]))})
write.table(correl, paste(row.names(datamatrix)[i], ".txt", sep=""))
}
But I am afraid the function(x) part is of problem, that seems to be t(datamatrix[i,j]), which will calculate corr of any two rows.
Actually I need to iterate through the matrix. first cor(row01, row02) get one correlation between rwo01 and row02; then cor(row01, row03) to get the correlation of row01 and rwo03, ....and till correlation between row01 row30000. Now I got the first column for row01 Row01 1.000 Row02 0.012 Row03 0.023 Row04 0.820 Row05 0.165 Row06 0.230 Row07 0.376 Row08 0.870 and save it to file row01.txt.
Similarly get Row02 Row01 0.012 Row02 1.000 Row03 0.023 Row04 0.820 Row05 0.165 Row06 0.230 Row07 0.376 Row08 0.870 and save it to file row02.txt.
Totally I will get 30000 files. It is stupid, but this can skip the memory limit and can be easily handled for the correlation of a specific row.
First, your code is erroneous : The correl step should be :
correl <- apply(datamatrix, 1, function(x) {cor(x,datamatrix[i, ])})
Then, you better close the connections explicitly, otherwise R may leave too many connections open.
Finally, using write.table doesn't guarantee that you can retrieve the data easily. You have to construct a table yourself. Try this code :
correl <- NULL
XX <- for (i in 1:nrow(datamatrix)) {
correl <- apply(datamatrix, 1, function(x) {cor(x,datamatrix[i, ])})
ff <- file(paste(row.names(datamatrix)[i], ".txt", sep=""),open="at")
write(paste("row","cor"),ff)
tmp <- paste(names(correl),correl)
write(tmp, ff,sep="\n")
}
close(ff)
}

Resources