I want to generate uniformly distributed random numbers twice. I used profvis to check the code.
I found that the second runif function is taking much more time than the first one. Is there any way to avoid this situation.
L is just an integer between 50 and 100. Please ignore the second line.
Additionally, in each of my loop, I rbind a new record to the current records data.frame. This rbind operation is also time-consuming.
If I know the number of records in advance, I can initialize a data.frame to store all records. But it cannot be known before the loop ends. Is there any way to append a row to an existing data.frame more quickly?
Or you can try this simple example to see how the second runif turns out to be.
library(profvis)
profvis({
runif(100000,0,1)
runif(100000,0,1)
})
Related
I have a list of dataframes (subspec2) which I want to loop through to get the columns with the maximum value from each dataframe, and write these to a new dataframe. I wrote the following loop:
good.data<-data.frame(matrix(nrow=401, ncol=78)) #create empty dataframe
for (i in length(subspec2)) ##subspec2 is the list of dataframes
{
max.name<-names(which.max(apply(subspec2[[i]],MARGIN=2,max))) #find column name with max value
good.data[,i]<-subspec2[[i]][max.name] #write the contents of this column into dataframe
}
This seems to work but only returns values in the last column, nothing else appears to have been saved. Many threads point out the df must be outside the loop, but that is not the problem here.
What am I doing wrong?
Thank you!
I believe you need to change for (i in length(subspec2)) to for (i in 1:length(subspec2)). The former will only do 1 iteration, where i = length(subspec2) whereas the latter iterates over multiple is.
(I am pretty sure that is your issue, but one thing that is great to do is to create a reproducible example so I can run your code to double check, for example I am not exactly sure what subspec2 looks like, and I am not able to run your code as it is, a great resource for this is the reprex package).
Most of the data sets that I have worked with has generally been of moderate size (mostly less than 100k rows) and hence my code's execution time has usually not been that big a problem for me.
But I was recently trying to write a function that takes 2 dataframes as arguments (with, say, m & n rows) and returns a new dataframe with m*n rows. I then have to perform some operations on the resulting data set. So, even with small values of m & n (say around 1000 each ) the resulting dataframe would have more than a million rows.
When I try even simple operations on this dataset, the code takes an intolerably long time to run. Specifically, my resulting dataframe has 2 columns with numeric values and I need to add a new column which will compare the values of these columns and categorize them as - "Greater than", "less than", "Tied"
I am using the following code:
df %>% mutate(compare=ifelse(var1==var2,"tied",
ifelse(var1>var2,"Greater than","lesser then")
And, as I mentioned before, this takes forever to run. I did some research on this, and I figured out that apparently operations on data.table is significantly faster than dataframe, so maybe that's one option I can try.
But I have never used data.tables before. So before I plunge into that, I was quite curious to know if there are any other ways to speed up computation time for large data sets.
What other options do you think I can try?
Thanks!
For large problems like this I like to parallelize. Since operations on individual rows are atomic, meaning that the outcome of an operation on a particular row is independent of every other row, this is an "embarassingly parallel" situation.
library(doParallel)
library(foreach)
registerDoParallel() #You could specify the number of cores to use here. See the documentation.
df$compare <- foreach(m=df$m, n=df$n, .combine='c') %dopar% {
#Borrowing from #nicola in the comments because it's a good solution.
c('Less Than', 'Tied', 'Greater Than')[sign(m-n)+2]
}
I'm learning R programming, and trying to understand the best approach to work with a vector when you don't know the final size it will end up being. For example, in my case I need to build the vector inside a for loop, but only for some iterations, which aren't know beforehand.
METHOD 1
I could run through the loop a first time to determine the final vector length, initialize the vector to the correct length, then run through the loop a second time to populate the vector. This would be ideal from a memory usage standpoint, since the vector memory would occupy the required amount of memory.
METHOD 2
Or, I could use one for loop, and simply append to the vector as needed, but this would be inefficient from a memory allocation standpoint since a new block may need to be assigned each time a new element is appended to the vector. If you're working with big data, this could be a problem.
METHOD 3
In C or Matlab, I usually initialize the vector length to the largest possible length that I know the final vector could occupy, then populate a subset of elements in the for loop. When the loop completes, I'll re-size the vector length appropriately.
Since R is used a lot in data science, I thought this would be a topic others would have encountered and there may be a best practice that was recommended. Any thoughts?
Canonical R code would use lapply or similar to run the function on each element, then combine the results in some way. This avoids the need to grow a vector or know the size ahead of time. This is the functional programming approach to things. For example,
set.seed(5)
x <- runif(10)
some_fun <- function(x) {
if (x > 0.5) {
return(x)
} else {
return(NULL)
}
}
unlist(lapply(x, some_fun))
The size of the result vector is not specified, but is determined automatically by combining results.
Keep in mind that this is a trivial example for illustration. This particular operation could be vectorized.
I think Method1 is the best approach if you have a very large amount of data. But in general you might want to read this chapter before you make a final decision:
http://adv-r.had.co.nz/memory.html
I am totally convinced that an efficient R programm should avoid using loops whenever possible and instead should use the big family of the apply functions.
But this cannot happen without pain.
For example I face with a problem whose solution involves a sum in the applied function, as a result the list of results is reduced to a single value, which is not what I want.
To be concrete I will try to simplify my problem
assume N =100
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
As you can see the function inside cause length of the built vector to explode
whereas using the sum inside would collapse everything to single value
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
What I would like to have is a the list of degree of N.
so what do you think? how can I repair it?
Your question doesn't contain reproducible code (what's "eps"?), but on the general point about for loops and optimising code:
For loops are not incredibly slow. For loops are incredibly slow when used improperly because of how memory is assigned to objects. For primitive objects (like vectors), modifying a value in a field has a tiny cost - but expanding the /length/ of the vector is fairly costly because what you're actually doing is creating an entirely new object, finding space for that object, copying the name over, removing the old object, etc. For non-primitive objects (say, data frames), it's even more costly because every modification, even if it doesn't alter the length of the data.frame, triggers this process.
But: there are ways to optimise a for loop and make them run quickly. The easiest guidelines are:
Do not run a for loop that writes to a data.frame. Use plyr or dplyr, or data.table, depending on your preference.
If you are using a vector and can know the length of the output in advance, it will work a lot faster. Specify the size of the output object before writing to it.
Do not twist yourself into knots avoiding for loops.
So in this case - if you're only producing a single value for each thing in N, you could make that work perfectly nicely with a vector:
#Create output object. We're specifying the length in advance so that writing to
#it is cheap
output <- numeric(length = length(N))
#Start the for loop
for(i in seq_along(output)){
output[i] <- your_computations_go_here(N[i])
}
This isn't actually particularly slow - because you're writing to a vector and you've specified the length in advance. And since data.frames are actually lists of equally-sized vectors, you can even work around some issues with running for loops over data.frames using this; if you're only writing to a single column in the data.frame, just create it as a vector and then write it to the data.frame via df$new_col <- output. You'll get the same output as if you had looped through the data.frame, but it'll work faster because you'll only have had to modify it once.
When doing sequencing, I normally apply TraMineR's seqdef function on a dataset to generate a single sequence object:
sequence_object <- seqdef(data)
However, let's say I want to loop through a dataframe and generate 1 sequence object per every chunk of 10 columns. Then I would do something like this:
colpicks <- seq(10,1000,by=10)
mapply(function(start,stop) seqdef(df[,start:stop]), colpicks-9, colpicks)
Now, I want to store these objects in some suitable manner. Two questions:
What is the most suitable way of storing (or maybe just automatically naming) 100 objects, so that I can easily loop through each of them at a later point?
How can I modify my code above so that it stores the data per your answer to (1)?
"Most suitable" is completely subjective and dependent on your goal.
I'm assuming this question is related to your previous question, and thus I would suggest setting the simplify argument of mapply to FALSE
myMatrixList <- mapply(.... , simplify=FALSE)
However, even that is not necessary, as you can just combine the sapply from the previous question and skip the middle step