I am trying to use the interp1 function in R for linearly interpolating a matrix without using a for loop. So far I have tried:
bthD <- c(0,2,3,4,5) # original depth vector
bthA <- c(4000,3500,3200,3000,2800) # original array of area
Temp <- c(4.5,4.2,4.2,4,5,5,4.5,4.2,4.2,4)
Temp <- matrix(Temp,2) # matrix for temperature measurements
# -- interpolating bathymetry data --
depthTemp <- c(0.5,1,2,3,4)
layerZ <- seq(depthTemp[1],depthTemp[5],0.1)
library(signal)
layerA <- interp1(bthD,bthA,layerZ);
# -- interpolate= matrix --
layerT <- list()
for (i in 1:2){
t <- Temp[i,]
layerT[[i]] <- interp1(depthTemp,t,layerZ)
}
layerT <- do.call(rbind,layerT)
So, here I have used interp1 on each row of the matrix in a for loop. I would like to know how I could do this without using a for loop. I can do this in matlab by transposing the matrix as follows:
layerT = interp1(depthTemp,Temp',layerZ)'; % matlab code
but when I attempt to do this in R
layerT <- interp1(depthTemp,t(Temp),layerZ)
it does not return a matrix of interpolated results, but a numeric array. How can I ensure that R returns a matrix of the interpolated values?
There is nothing wrong with your approach; I probably would avoid the intermediate t <-
If you want to feel R-ish, try
apply(Temp,1,function(t) interp1(depthTemp,t,layerZ))
You may have to add a t(ranspose) in front of all if you really need it that way.
Since this is a 3d-field, per-row interpolation might not be optimal. My favorite is interp.loess in package tgp, but for regular spacings other options might by available. The method does not work for you mini-example (which is fine for the question), but required a larger grid.
Related
I am trying to calculate network indexes (clustering, modularity, edge density, degree, centrality etc) from 1000 simulated null matrices using the igraph package in R. The data I'm using is a mixed-species bird flock data that I've used to generate the null matrices.
Here's the code:
## Construct null matrices ##
library(EcoSimR)
library(igraph)
# creating a 1000 empty matrices
fl_emp <- lapply(1:1000, function(i) data.frame())
# simulating 1000 matrices by randomization
fl_wp_n <- replicate(1000, sim5(fl_wp[,3:ncol(fl_wp)]),simplify = FALSE) #fl_wp is the raw data
#sim5 function is from the package 'EcoSimR'
for(i in 1:length(fl_emp))
{
fl_wp_ig <- graph_from_incidence_matrix(fl_wp_n[[i]]) #Creating new igraph object to convert the null matrices to igraph objects to calculate network indexes
fl_wp_cw <- cluster_walktrap(fl_wp_ig[[i]])
fl_wp_mod <- modularity(fl_wp_cw[[i]]) ##Network index, this does not work
}
Here's what the simulated matrices look like(fl_wp_n) :
[1]: https://i.stack.imgur.com/1Q0Na.png
It is basically a list of 1000 elements, where each element is a simulated 133x74 matrix where the rows represent flock ID and the columns represent Species ID.
This is the error I'm getting when I run the loop:
> for(i in 1:length(fl_emp))
+ {
+ fl_wp_ig <- graph_from_incidence_matrix(fl_wp_n[[i]])
+ fl_wp_cw <- cluster_walktrap(fl_wp_ig[[i]])
+ fl_wp_mod <- modularity(fl_wp_cw[[i]])
+ }
Error in cluster_walktrap(fl_wp_ig[[i]]) : Not a graph object!
It seems to be not recognizing fl_wp_ig as an igraph object. Any idea why?
Is there a better way to do calculate indices for a 1000 matrices in one loop?
Sorry if this is a dumb question, I'm new to igraph and R in general
Thanks a lot in advance!
If you have a look at the documentation for 1. cluster_walktrap, you will see the function expects a graph object. As #Szabolcs pointed out, when you are index fl_wp_ig[[i]] in the for-loop, you are returning the vertices adjacent to vertex [[i]], but not the graph itself. You only should iterate over fl_wp_n[[i]] because you want to use every time a 'matrix' but not the other variables.
So you could try:
list_outputs = list()
for(i in 1:length(fl_emp))
{
# fl_wp_n[[i]] gets 1 matrix each iteration. Output -> graph object
fl_wp_ig <- graph_from_incidence_matrix(fl_wp_n[[i]])
# Use the whole graph object fl_wp_ig
fl_wp_cw <- cluster_walktrap(fl_wp_ig)
# Use the whole fl_wp_cw output
fl_wp_mod <- modularity(fl_wp_cw)
# NOTE that you are not storing the result of each iteration in a variable to keep it,
# you are overwritting fl_wp_mod
# You could have create a empty list before the for-loop and then fill it
list_outputs = append(list_outputs, fl_wp_mod)
}
Also, if you find it difficult to see the whole picture, you could try to create a custom function and use apply methods instead of a for-loop.
# Custom function
cluster_modularity = function(graph_object){
# takes only one graph_object at time
fl_wp_ig <- graph_from_incidence_matrix(graph_object)
fl_wp_cw <- cluster_walktrap(fl_wp_ig)
fl_wp_mod <- modularity(fl_wp_cw)
}
# Iterate using lapply to store the outputs in a list - for example
list_outputs = lapply(fl_wp_n, cluster_modularity)
I have a big dataset with 30000 rasters. My goal is to extract a mean value using the polygon located within the raster and create a file with extracted rasters values and dates from rasters filenames.
I succeeded in doing this by performing the following loop:
for (i in 1:length(rasters2014)){
a <- raster(rasters2014[i])
ext[i] <- as.vector(extract(a, poligon2, fun=mean, na.rm=TRUE, df=F))
}
output2 = data.frame(ext, filename=filename2014)
The problem is that the presented above loop takes about 2.5h hours to complete the calculation. Does anyone have an idea how I could speed up this process?
If your raster are all properly aligned (same ncol, nrow, extent, origin, resolution), you could try identifying the "cell numbers" to be extracted by looking on the first file, then
extracting based on those. This could speed-up the processing beacause raster does not need to compute which cells to extract. Something like this:
rast1 <- raster(rasters2014[1])
cells <- extract(rast1, poligon2, cellnumbers = TRUE, df = TRUE)[,"cells"]
ext <- list()
for (i in 1:length(rasters2014)){
a <- raster(rasters2014[i])
ext[[i]] <- as.vector(extract(a, cells, fun=mean, na.rm=TRUE, df=F))
}
Note that I am also using a list to store the results to avoid "growing" a vector, which is usually wasteful.
Alternatively, as suggested by #qdread, you could build a rasterStack using raster::stack(rasters2014, quick = TRUE) and call extract over the stack to avoid the for loop. Don't know which would be faster.
HTH
If your polygons do not overlap (and in most cases they don't) an alternative route is
library(raster)
x <- rasterize(poligon2, rasters2014[1])
s <- raster::stack(rasters2014, quick = TRUE)
z <- zonal(s, x, "mean")
PS: Faster is nicer, but I would suggest getting lunch while this runs.
Thanks for your help! I've tried all of the proposed solutions and the computation time generally the same regardless of the applied method. Therefore, I guess that it is just not possible to significantly speed up the computational process.
I'm trying to understand the answer to this question using R and I'm struggling a lot.
The dataset for the R code can be found with this code
library(devtools)
install_github("genomicsclass/GSE5859Subset")
library(GSE5859Subset)
data(GSE5859Subset) ##this loads the three tables you need
Here is the question
Write a function that takes a vector of values e and a binary vector group coding two groups, and returns the p-value from a t-test: t.test( e[group==1], e[group==0])$p.value.
Now define g to code cases (1) and controls (0) like this g <- factor(sampleInfo$group)
Next use the function apply to run a t-test for each row of geneExpression and obtain the p-value. What is smallest p-value among all these t-tests?
The answer provided is
myttest <- function(e,group){
x <- e[group==1]
y <- e[group==0]
return( t.test(x,y)$p.value )
}
g <- factor(sampleInfo$group)
pvals <- apply(geneExpression,1,myttest, group=g)
min( pvals )
Which gives you the answer of 1.406803e-21.
What exactly is the input of the "e" argument of the myttest function when you run this? Is it possible to write this function as a formula like
t.test(DV ~ sampleInfo$group)
The t test is comparing the gene expression values of the 24 people (the values of which I believe are in the "geneExpression" matrix) by what group they were
in which you can find in sampleInfo's "group" column. I've run t tests so many times in R, but for some reason I can't wrap my mind around what's going on in this code.
You question seems to be about understanding the function apply().
For the technical description, see ?apply.
My quick explanation: the apply() line of code in your question applies the following function to each of the rows of geneExpression
myttest(e=x, group=g)
where x is a placeholder for each row.
To help make sense of it, a for loop version of that apply() line would look something like:
N <- nrows(geneExpression) #so we don't have to type this twice
pvals <- numeric(N) #empty vector to store results
# what 'apply' does (but it does it very quickly and with less typing from us)
for(i in 1:N) {
pvals[i] <- myttest(geneExpression[i,], group=g[i])
}
I am normally a maple user currently working with R, and I have a problem with correctly indexing variables.
Say I want to define 2 vectors, v1 and v2, and I want to call the nth element in v1. In maple this is easily done:
v[1]:=some vector,
and the nth element is then called by the command
v[1][n].
How can this be done in R? The actual problem is as follows:
I have a sequence M (say of length 10, indexed by k) of simulated negbin variables. For each of these simulated variables I want to construct a vector X of length M[k] with entries given by some formula. So I should end up with 10 different vectors, each of different length. My incorrect code looks like this
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
for(k in 1:sims){
x[k]<-rep(NA,M[k])
X[k]<-rep(NA,M[k])
for(i in 1:M[k]){x[k][i]<-runif(1,min=0,max=1)
if(x[k][i]>=0 & x[i]<=0.1056379){
X[k][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[k][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
}
The error appears to be that x[k] is not a valid name for a variable. Any way to make this work?
Thanks a lot :)
I've edited your R script slightly to get it working and make it reproducible. To do this I had to assume that eks_2016_kasko was an integer value of 10.
require(MASS)
sims<-10
# Because you R is not zero indexed add one
M<-rnegbin(sims, 10*exp(-2.17173), 840.1746) + 1
# Create a list
x <- list()
X <- list()
for(k in 1:sims){
x[[k]]<-rep(NA,M[k])
X[[k]]<-rep(NA,M[k])
for(i in 1:M[k]){
x[[k]][i]<-runif(1,min=0,max=1)
if(x[[k]][i]>=0 & x[[k]][i]<=0.1056379){
X[[k]][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[[k]][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
This will work and I think is what you were trying to do, BUT is not great R code. I strongly recommend using the lapply family instead of for loops, learning to use data.table and parallelisation if you need to get things to scale. Additionally if you want to read more about indexing in R and subsetting Hadley Wickham has a comprehensive break down here.
Hope this helps!
Let me start with a few remarks and then show you, how your problem can be solved using R.
In R, there is most of the time no need to use a for loop in order to assign several values to a vector. So, for example, to fill a vector of length 100 with uniformly distributed random variables, you do something like:
set.seed(1234)
x1 <- rep(NA, 100)
for (i in 1:100) {
x1[i] <- runif(1, 0, 1)
}
(set.seed() is used to set the random seed, such that you get the same result each time.) It is much simpler (and also much faster) to do this instead:
x2 <- runif(100, 0, 1)
identical(x1, x2)
## [1] TRUE
As you see, results are identical.
The reason that x[k]<-rep(NA,M[k]) does not work is that indeed x[k] is not a valid variable name in R. [ is used for indexing, so x[k] extracts the element k from a vector x. Since you try to assign a vector of length larger than 1 to a single element, you get an error. What you probably want to use is a list, as you will see in the example below.
So here comes the code that I would use instead of what you proposed in your post. Note that I am not sure that I correctly understood what you intend to do, so I will also describe below what the code does. Let me know if this fits your intentions.
# define M
library(MASS)
eks_2016_kasko <- 486689.1
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
# define the function that calculates X for a single value from M
calculate_X <- function(m) {
x <- runif(m, min=0,max=1)
X <- ifelse(x > 0.1056379, rlnorm(m, 6.228244, 0.3565041),
rlnorm(m, 8.910837, 1.1890874))
}
# apply that function to each element of M
X <- lapply(M, calculate_X)
As you can see, there are no loops in that solution. I'll start to explain at the end:
lapply is used to apply a function (calculate_X) to each element of a list or vector (here it is the vector M). It returns a list. So, you can get, e.g. the third of the vectors with X[[3]] (note that [[ is used to extract elements from a list). And the contents of X[[3]] will be the result of calculate_X(M[3]).
The function calculate_X() does the following: It creates a vector of m uniformly distributed random values (remember that m runs over the elements of M) and stores that in x. Then it creates a vector X that contains log normally distributed random variables. The parameters of the distribution depend on the value x.
I am still quite new to r (used to program in Matlab) and I am trying use the parallel package to speed up some calculations. Below is an example which I am trying to calculate the rolling standard deviation of a matrix (by column) with the use of zoo package, with and without parallelising the codes. However, the shape of the outputs came out to be different.
# load library
library('zoo')
library('parallel')
library('snow')
# Data
z <- matrix(runif(1000000,0,1),100,1000)
#This is what I want to calculate with timing
system.time(zz <- rollapply(z,10,sd,by.column=T, fill=NA))
# Trying to achieve the same output with parallel computing
cl<-makeSOCKcluster(4)
clusterEvalQ(cl, library(zoo))
system.time(yy <-parCapply(cl,z,function(x) rollapplyr(x,10,sd,fill=NA)))
stopCluster(cl)
My first output zz has the same dimensions as input z, whereas output yy is a vector rather than a matrix. I understand that I can do something like matrix(yy,nrow(z),ncol(z)) however I would like to know if I have done something wrong or if there is a better way of coding to improve this. Thank you.
From the documentation:
parRapply and parCapply always return a vector. If FUN always returns
a scalar result this will be of length the number of rows or columns:
otherwise it will be the concatenation of the returned values.
And:
parRapply and parCapply are parallel row and column apply functions
for a matrix x; they may be slightly more efficient than parApply but
do less post-processing of the result.
So, I'd suggest you use parApply.