append new row for each run of function - r

I am trying to append a new row to a matrix for each time I run a function. I reckon, the first time the function is run a matrix is created and the succeeding times, a new row with values is appended.
Here is some dummy data. Lets say x and y are sides of rectangle and z some sort of ID. In reality, these are not known in advance, but outputted by the function. The real function takes a species directory as argument, reads shapefiles, merges polygons and does a bunch of other things, but outputs the surface area. For each species (i.e. run of function) I would like to store each outputted area in a matrix or a data.frame for further analysis instead of outputting it to individual variables.
myfunc <- function(x, y, z){
area <- x*y
id <- z
tmp <- cbind(area,id)
assign(as.matrix('mtrx'), rbind(tmp), envir=.GlobalEnv)
}
The above obviously only creates the matrix and overwrites it each time the function is run.
Any pointers would be very much appreciated!

If, as in your example, you know the values for x, y and z in advance, it makes sense to say something like:
> f1 <- function(x, y, z) c(x*y, z)
mapply(f1, x=seq(4), y=seq(4), z=seq(4))
> [,1] [,2] [,3] [,4]
[1,] 1 4 9 16
[2,] 1 2 3 4
If the values for these variables are returned by another function, then perhaps best to store them until you're ready to run all the values through the final function (e.g. f1 above).
You say
a new row with values is appended
but in RAM a new matrix is created (assigned) with the new row added each time you append. (You're in Circle 2).
For small sized data this is not likely to be a problem in practice.
Also, using assign can make scoping awkward when calling a function within an environment (e.g. another function), so generally best to avoid if possible. There's usually a better alternative.

Here's the basic idea.
myfunc <- function(ID) {
# do a bunch of stuff based on ID
# calculate area
area <- 2*ID + rnorm(1,0,10) # fake the area...
return(c(ID=ID,area=area))
}
ID.list <- rep(1:100) # list of ID's
result <- do.call(rbind,lapply(ID.list,myfunc))
# head(result)
# ID area
# [1,] 1 -14.794850
# [2,] 2 13.777036
# [3,] 3 17.807578
# [4,] 4 21.070712
# [5,] 5 11.904047
# [6,] 6 3.735771
Return ID and area as a named vector with c(ID=ID, area=area). Do this for all ID's with the call to lapply(...). Then bind them all together using do.call(rbind,...).

I highly recommend against this method, but you need to use get in that last line
assign('mtrx', rbind(get('mtrx', envir=parent.frame()), tmp)), envir=.GlobalEnv)

Related

Using a loop to extract values of lists within lists in R

I have a list (chla_vals) containing 12 monthly values of chlorophyll for two lakes (so 24 total chla values). It's important to note that when I use length() to check the length of this list, the result is 12. It's because of this that I believe this is a list of lists.
some helpful folks on here had a vaguely similar situation, so I adapted their code to extract the chla values for the first lake using:
chl_lake1 <- sapply(chla_vals, '[', 1)
Although this works great, I have multiple lists similar to chla_vals that contain slightly different values based on the method used to measure chlorophyll-a, and in the future I will have more lakes. I am therefore trying to write a loop, that will extract these chl-a values for each month and each lake and output them into a dataframe.
I figured the best way to do this, was to combine all my lists (eg: chla_vals, chla_vals1, chla_vals2) into another list, and loop over it. I am a beginner in R so if this is not best practices please let met know.
My code so far:
#Reproducible examples of chla_vals, chla_vals2, and chla_vals3
chla_vals <- list(runif(2))
chla_vals <- rep(chla_vals, 12)
chla_vals2 <- list(runif(2))
chla_vals2 <- rep(chla_vals2, 12)
chla_vals3 <- list(runif(2))
chla_vals3 <- rep(chla_vals3, 12)
#Combining all lists into a larger list, and specifying the list names
chla_comb <- list(chla_vals = chla_vals, chla_vals2 = chla_vals2, chla_vals3 = chla_vals3)
#Storing the names of each list
list_names <- names(chla_comb)
#Creating an empty dataframe to store my values, with 3 columns corresponding to my three original lists
#I know I will need more columns (eg: chlorophyll-a method, month, and lake ID), but I figure I can sort that later
values_df <- setNames(data.frame(matrix(ncol = length(list_names))), list_names)
#The actual loop:
for (i in seq_along(chla_comb)) {
v <- chla_comb[i]
values_df[i] <- lapply(v, '[', 1)
}
This kind of works, but is only storing the two first values in each list (ie the chl-a value for lake 1 january, and the value for lake 2 january) for each chl-a method (chla_vals, chla_vals1, and chla_vals2). I need all 24 values for each method as i'm interested in the change in chlorophyll over time
EDIT: Included a small reproducible example. This is the best I could create but the lists don't look exactly like what I have. I think solutions will work either way.
You can use c to combine the lists into one and with do.call(rbind) convert it into a 2 column matrix.
values_mat <- do.call(rbind, c(chla_vals, chla_vals2, chla_vals3))
values_mat
# [,1] [,2]
# [1,] 0.1264052 0.1575803
# [2,] 0.1264052 0.1575803
# [3,] 0.1264052 0.1575803
# [4,] 0.1264052 0.1575803
# [5,] 0.1264052 0.1575803
# [6,] 0.1264052 0.1575803
# [7,] 0.1264052 0.1575803
#...
#...

Repeated subsetting of the same matrix using apply in R

Motivation: I am currently trying to rethink my coding such as to exclude for-loops where possible. The below problem can easily be solved with conventional for-loops, but I was wondering if R offers a possibility to utilize the apply-family to make the problem easier.
Problem: I have a matrix, say X (n x k matrix) and two matrices of start and stop indices, called index.starts and index.stops, respectively. They are of size n x B and it holds that index.stops = index.starts + m for some integer m. Each pair index.starts[i,j] and index.stops[i,j] are needed to subset X as X[ (index.starts[i,j]:index.stops[i,j]),]. I.e., they should select all the rows of X in their index range.
Can I solve this problem using one of the apply functions?
Application: (Not necessarily important for understanding my problem.) In case you are interested, this is needed for a bootstrapping application with blocks in a time series application. The X represents the original sample. index.starts is sampled as replicate(repetitionNumber, sample.int((n-r), ceiling(n/r), replace=TRUE)) and index.stopsis obtained as index.stop = index.starts + m. What I want in the end is a collection of rows of X. In particular, I want to resample repetitionNumber times m blocks of length r from X.
Example:
#generate data
n<-100 #the size of your sample
B<-5 #the number of columns for index.starts and index.stops
#and equivalently the number of block bootstraps to sample
k<-2 #the number of variables in X
X<-matrix(rnorm(n*k), nrow=n, ncol = k)
#take a random sample of the indices 1:100 to get index.starts
r<-10 #this is the block length
#get a sample of the indices 1:(n-r), and get ceiling(n/r) of these
#(for n=100 and r=10, ceiling(n/r) = n/r = 10). Replicate this B times
index.starts<-replicate(B, sample.int((n-r), ceiling(n/r), replace=TRUE))
index.stops<-index.starts + r
#Now can I use apply-functions to extract the r subsequent rows that are
#paired in index.starts[i,j] and index.stops[i,j] for i = 1,2,...,10 = ceiling(n/r) and
#j=1,2,3,4,5=B ?
It's probably way more complicated than what you want/need, but here is a first approach. Just comment if that helps you in any way and I am happy to help.
My approach uses (multiple) *apply-functions. The first lapply "loops" over 1:B cases, where it first calculates the start and end points, which are combined into the take.rows (with subsetting numbers). Next, the inital matrix is subsetted by take.rows (and returned in a list). As a last step, the standard deviation is taken for each column of the subsetted matrizes (as a dummy function).
The code (with heavy commenting) looks like this:
# you can use lapply in parallel mode if you want to speed up code...
lapply(1:B, function(i){
starts <- sample.int((n-r), ceiling(n/r), replace=TRUE)
# [1] 64 22 84 26 40 7 66 12 25 15
ends <- starts + r
take.rows <- Map(":", starts, ends)
# [[1]]
# [1] 72 73 74 75 76 77 78 79 80 81 82
# ...
res <- lapply(take.rows, function(subs) X[subs, ])
# res is now a list of 10 with the ten subsets
# [[1]]
# [,1] [,2]
# [1,] 0.2658915 -0.18265235
# [2,] 1.7397478 0.66315385
# ...
# say you want to compute something (sd in this case) you can do the following
# but better you do the computing directly in the former "lapply(take.rows...)"
res2 <- t(sapply(res, function(tmp){
apply(tmp, 2, sd)
})) # simplify into a vector/data.frame
# [,1] [,2]
# [1,] 1.2345833 1.0927203
# [2,] 1.1838110 1.0767433
# [3,] 0.9808146 1.0522117
# ...
return(res2)
})
Does that point you in the right direction/gives you the answer?

Suppress large output to R console

How can I make R check whether an object is too large to print in the console? "Too large" here means larger than a user-defined value.
Example: You have a list f_data with two elements f_data$data (a 100MB data.frame) and f_data$info (for instance, a vector). Assume you want to inspect the first few entries of the f_data$data data.frame but you make a mistake and type head(f_data) instead of head(f_data$data). R will try to print the whole content of f_data to the console (which would take forever).
Is there somewhere an option that I can set in order to suppress the output of objects that are larger than let's say 1MB?
Edit: Thank you guys for your help. After implementing the max.rows option I realized that this gives indeed the desired output. BUT the problem that the output takes very long to show up still persists. I will give you a proper example below.
df_nrow=100000
df_ncol=100
#create list with first element being a large data.frame
#second element is a short vector
test_list=list(df=data.frame(matrix(rnorm(df_nrow*df_ncol),nrow=df_nrow,ncol=df_ncol)),
vec=1:110)
#only print the first 100 elements of an object
options(max.print=100)
#head correctly displays the first row of the data.frame
#BUT for some reason the output takes really long to show up in the console (~30sec)
head(test_list)
#let's try to see how long exactly
system.time(head(test_list))
# user system elapsed
# 0 0 0
#well, obviously system.time is not the proper tool to measure this
#the same problem if I just print the object to the console without using head
test_list$df
I assume that R performs some sort of analysis on the object being printed and this is what takes so long.
Edit 2:
As per my comment below, I checked whether the problem persists if I use a matrix instead of a data.frame.
#create list with first element being a large MATRIX
test_list=list(mat=matrix(rnorm(df_nrow*df_ncol),nrow=df_nrow,ncol=df_ncol),vec=1:110)
#no problem
head(test_list)
#no problem
test_list$mat
Could it be that the output to the console is not really efficiently implemented for data.frame objects?
I think there is no such option, but you can check the size of an object with object.size and print it if is lower than a threshold (measure in bytes), for example:
print.small.objects <- function(x, threshold = 1e06, ...)
{
if (object.size(x) < threshold) {
print(x, ...)
} else {
cat(paste("too big object\n"))
print(object.size(x))
}
}
Here's an example that you could adjust up to 100MB. It basically only prints the first 6 rows and 5 columns if the object's size is above 8e5 bytes. You could also turn this into a function and place it in your .Rprofile
> lst <- list(data.frame(replicate(100, rnorm(1000))), 1:10)
> sapply(lst, object.size)
# [1] 810968 88
> lapply(lst, function(x){
if(object.size(x) > 8e5) head(x)[1:5] else x
})
#[[1]]
# X1 X2 X3 X4 X5
#1 0.3398235 -1.7290077 -0.35367971 0.09874918 -0.8562069
#2 0.2318548 -0.3415523 -0.38346083 -0.08333569 -1.1091982
#3 0.0714407 -1.4561768 0.50131914 -0.54899188 0.1652095
#4 -0.5170228 1.7343073 -0.05602883 0.87855313 0.4025590
#5 0.6962212 -0.3179930 0.28016057 1.05414456 -0.5172885
#6 0.9471200 1.4424843 -1.46323827 -0.78004192 -1.3611820
#
#[[2]]
# [1] 1 2 3 4 5 6 7 8 9 10

Need a more efficient threshold matching with function for R

Not sure how best to ask this question, so feel free to edit the question title if there is a more standard vocabulary to use here.
I have two 2-column data tables in R, the first is a list of unique 2-variable values (u), so much shorter than the second, which is a raw list of similar values (d). I need a function that will, for every 2-variable set of values in u, find all the 2-variable sets of values in d for which both variables are within a given threshold.
Here's a minimal example. Actual data is much larger (see below, as this is the problem) and (obviously) not created randomly as in the example. In the actual data, u would have about 600,000 to 1,000,000 values (rows) and d would have upwards of 10,000,000 rows.
# First create the table of unique variable pairs (no 2-column duplicates)
u <- data.frame(PC1=c(-1.10,-1.01,-1.13,-1.18,-1.12,-0.82),
PC2=c(-1.63,-1.63,-1.81,-1.86,-1.86,-1.77))
# Now, create the set of raw 2-variable pairs, which may include duplicates
d <- data.frame(PC1=sample(u$PC1,100,replace=T)*sample(90:100,100,replace=T)/100,
PC2=sample(u$PC2,100,replace=T)*sample(90:100,100,replace=T)/100)
# Set the threshold that defined a 'close-enough' match between u and d values
b <- 0.1
So, my first attempt to do this was with a for loop for all values of u. This works nicely, but is computationally intensive and takes quite a while to process the actual data.
# Make a list to output the list of within-threshold rows
m <- list()
# Loop to find all values of d within a threshold b of each value of u
# The output list will have as many items as values of u
# For each list item, there may be up to several thousand matching rows in d
# Note that there's a timing command (system.time) in here to keep track of performance
system.time({
for(i in 1:nrow(u)){
m <- c(m, list(which(abs(d$PC1-u$PC1[i])<b & abs(d$PC2-u$PC2[i])<b)))
}
})
m
That works. But I thought using a function with apply() would be more efficient. Which it is...
# Make the user-defined function for the threshold matching
match <- function(x,...){
which(abs(d$PC1-x[1])<b & abs(d$PC2-x[2])<b)
}
# Run the function with the apply() command.
system.time({
m <- apply(u,1,match)
})
Again, this apply function works and is slightly faster than the for loop, but only marginally. This may simply be a big data problem for which I need a bit more computing power (or more time!). But I thought others might have thoughts on a sneaky command or function syntax that would dramatically speed this up. Outside the box approaches to finding these matching rows also welcome.
Somewhat sneaky:
library(IRanges)
ur <- with(u*100L, IRanges(PC2, PC1))
dr <- with(d*100L, IRanges(PC2, PC1))
hits <- findOverlaps(ur, dr + b*100L)
Should be fast once the number of rows is sufficiently large. We multiply by 100 to get into integer space. Reversing the order of the arguments to findOverlaps could improve performance.
Alas, this seems only slightly faster than the for loop
unlist(Map(function(x,y) {
which(abs(d$PC1-x)<b & abs(d$PC2-y)<b)
}, u$PC1, u$PC2))
but at least it's something.
I have a cunning plan :-) . How about just doing calculations:
> set.seed(10)
> bar<-matrix(runif(10),nc=2)
> bar
[,1] [,2]
[1,] 0.50747820 0.2254366
[2,] 0.30676851 0.2745305
[3,] 0.42690767 0.2723051
[4,] 0.69310208 0.6158293
[5,] 0.08513597 0.4296715
> foo<-c(.3,.7)
> thresh<-foo-bar
> sign(thresh)
[,1] [,2]
[1,] -1 1
[2,] 1 1
[3,] -1 1
[4,] 1 -1
[5,] 1 1
Now all you have to do is select the rows of that last matrix which are c(-1,1) , using which , and you can easily extract the desired rows from your bar matrix. Repeat for each row in foo.

R: a for statement wanted that allows for the use of values from each row

I'm pretty new to R..
I'm reading in a file that looks like this:
1 2 1
1 4 2
1 6 4
and storing it in a matrix:
matrix <- read.delim("filename",...)
Does anyone know how to make a for statement that adds up the first and last numbers of one row per iteration ?
So the output would be:
2
3
5
Many thanks!
Edit: My bad, I should have made this more clear...
I'm actually more interested in an actual for-loop where I can use multiple values from any column on that specific row in each iteration. The adding up numbers was just an example. I'm actually planning on doing much more with those values (for more than 2 columns), and there are many rows.
So something in the lines of:
for (i in matrix_i) #where i means each row
{
#do something with column j and column x from row i, for example add them up
}
If you want to get a vector out of this, it is simpler (and marginally computationally faster) to use apply rather than a for statement. In this case,
sums = apply(m, 1, function(x) x[1] + x[3])
Also, you shouldn't call your variables "matrix" since that is the name of a built in function.
ETA: There is an even easier and computationally faster way. R lets you pull out columns and add them together (since they are vectors, they will get added elementwise):
sums = m[, 1] + m[, 3]
m[, 1] means the first column of the data.
Something along these lines should work rather efficiently (i.e. this is a vectorised approach):
m <- matrix(c(1,1,1,2,4,6,1,2,4), 3, 3)
# [,1] [,2] [,3]
# [1,] 1 2 1
# [2,] 1 4 2
# [3,] 1 6 4
v <- m[,1] + m[,3]
# [1] 2 3 5
You probably can use an apply function or a vectorized approach --- and if you can you really should, but you ask for how to do it in a for loop, so here's how to do that. (Let's call your matrix m.)
results <- numeric(nrow(m))
for (row in nrow(m)) {
results[row] <- m[row, 1] + m[row, 3]
}
This is probably one of those 100 ways to skin a cat questions. You are perhaps looking for the rowSums function, although you might also find many answers using the apply function.

Resources