Working with some big data.frames in R, and wanted to know which one of the 2 options is more efficient timewise.
df[which(condition), ] = value
or
df[condition, ] = value
Assuming that most of the data doesn't fulfill the condition, and length(which(condition)) is much much smaller than the boolean vector.
Is it more efficient to ask for specific indices than going through the whole data.frame/vector and for each row/element and choose it if the boolean vector is true at the position.
Or maybe if I call another function, it only delays performance.
I assumed someone else already asked this, but could not find an answer, this seems relevent, but the discussions I saw there are only if you need the boolean vector/indices again.
Related
I'm trying to declare an array in R, something logically equivalent to the following Java code:
Object[][] array = new Object[6][32]
After I declare this array, I plan to loop over the indices and assign values to them.
I am not familiar with what you are planning on doing in R, but loops are generally not recommended. I would say this is especially true when you don't know the length of the output.
You might want to find a "vectorized" solution first and if not, then using something in the apply family might also be helpful
Disclaimer: I am certain there is more nuance to this discussion based on what I have read, so I don't want to claim to be an expert on this subject.
I am trying to find an efficient way to create a new array by repeating each element of an old array a different, specified number of times. I have come up with something that works, using array comprehensions, but it is not very efficient, either in memory or in computation:
LENGTH = 1e6
A = collect(1:LENGTH) ## arbitrary values that will be repeated specified numbers of times
NumRepeats = [rand(20:100) for idx = 1:LENGTH] ## arbitrary numbers of times to repeat each value in A
B = vcat([ [A[idx] for n = 1:NumRepeats[idx]] for idx = 1:length(A) ]...)
Ideally, what I would like would be a structure akin to the sparse matrix apparatus that Julia has but that would instead store data efficiently based on the indices where repeated values occur. Barring that, I would at least like an efficient way to create a vector such as B in the example above. I looked into the repeat() function, but as far as I can tell from the documentation and my experimentation with the function, it is just for repeating slices of an array the same number of times for each slice. What is the best way to approach this?
Sounds like you're looking for run-length encoding. There's an RLEVectors.jl package here: https://github.com/phaverty/RLEVectors.jl. Not sure how usable it is. You could also make your own data type fairly easily.
Thanks for trying RLEVectors.jl. Some features and optimizations had been languishing on master without a version bump. It can definitely be mixed with other vectors for element-wise arithmetic. I'll put the linear algebra operations on the feature request list. Any additional feature suggestions would be most welcome.
RLEVectors.jl has a rep function that works like R's and RLEVectors.inverse_ree is like StatsBase.inverse_rle, but it works on run ends rather than lengths.
I'm optimizing a more complex code, but got stuck with this problem.
a<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
m<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
f<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
g<-array(NA,c(10,10))
I need to use the values in a & m to index f and assign the value from f to g
i.e. g[1,1]<-f[a[1,1],m[1,1]] except for all the indexes, and as optimally/fast as possible
I could obviously make a for loop to do this for me but that seems rather dumb and slow. It seems like I should be able to us something in the apply family, but I've had no luck with figuring out how to do that. I do need to keep the data structured as it is here so that I can use matrix operations in different parts of my code. I've been searching for an answer to this but haven't found anything particularly helpful yet.
g[] <- f[cbind(c(a), c(m))]
This takes advantage of the fact that matrices can be addressed as vectors and using a matrix as the index.
I'm learning R programming, and trying to understand the best approach to work with a vector when you don't know the final size it will end up being. For example, in my case I need to build the vector inside a for loop, but only for some iterations, which aren't know beforehand.
METHOD 1
I could run through the loop a first time to determine the final vector length, initialize the vector to the correct length, then run through the loop a second time to populate the vector. This would be ideal from a memory usage standpoint, since the vector memory would occupy the required amount of memory.
METHOD 2
Or, I could use one for loop, and simply append to the vector as needed, but this would be inefficient from a memory allocation standpoint since a new block may need to be assigned each time a new element is appended to the vector. If you're working with big data, this could be a problem.
METHOD 3
In C or Matlab, I usually initialize the vector length to the largest possible length that I know the final vector could occupy, then populate a subset of elements in the for loop. When the loop completes, I'll re-size the vector length appropriately.
Since R is used a lot in data science, I thought this would be a topic others would have encountered and there may be a best practice that was recommended. Any thoughts?
Canonical R code would use lapply or similar to run the function on each element, then combine the results in some way. This avoids the need to grow a vector or know the size ahead of time. This is the functional programming approach to things. For example,
set.seed(5)
x <- runif(10)
some_fun <- function(x) {
if (x > 0.5) {
return(x)
} else {
return(NULL)
}
}
unlist(lapply(x, some_fun))
The size of the result vector is not specified, but is determined automatically by combining results.
Keep in mind that this is a trivial example for illustration. This particular operation could be vectorized.
I think Method1 is the best approach if you have a very large amount of data. But in general you might want to read this chapter before you make a final decision:
http://adv-r.had.co.nz/memory.html
How does one work with sums in R? I can't seem to find an easy way to calculate sums \sum_{i=m}^n a_i. There are three things to decide here; where summation starts, where it ends, and what elements are to be summed.
I have a data frame df and I would like to calculate sum_{i=1}^{n-3} df$col[i]*df$col[i+3], col being a column of length 1000 in df, i.e. n=1000... How can I do this? I've found one very cumbersome way of doing it, namely
new = NULL
for (n in 1:997)
{ new = df$col[n]*df$col[n+3] }
sum(new)
That's a stupid way of doing it, so how do it in a more "natural" way? Yes, I'm sure this precise question has been asked but I didn't know how to narrow down my searches. "R+sum+why dont programmers think like mathematicians", maybe ;) Anyway, hints or links to tutorials for R beginners would be much appreciated, thanks.
You can do this with:
sum(df$col[1:997] * df$col[4:1000])
This will be a good deal quicker than looping through the indices and individually multiplying.
You should rather use vectorization and some tricks to avoid indexes:
with(df, sum(head(col,-3)*tail(col,-3)))
Or use the lead function:
sum(df$col * lead(df$col, 3, default = 0))