I have a dataframe (check the picture). I am creating periods of 30 values and I am calculating how many of this values are over 0.1. At the end, I want to save all the 336 outputs in a dataframe (as a row). How could I do that? My code is failing!
i <- 0
secos=as.data.frame(NULL)
for (i in c(0:336)){
hola=as.data.frame(pp[c(1+i:29 + i)])
secos[[i]]=sum(hola > 0.1)
secos=rbind(secos[[i]])}
Iteratively building (growing) data.frames in R is a bad thing. For good reading, see the R Inferno, chapter 2 on Growing Objects. Bottom line, though: it works, but as you add more rows, it will get progressively slower and use (at least) twice as much memory as you intend.
You explicitly overwrite secos with rbind(secos[[i]]), where the rbind call is a complete no-op doing nothing. (e.g., see identical(rbind(mtcars), mtcars)). Back to (1), best to L <- lapply(0:336, function(i) ...) then secos <- do.call(rbind, L).
R indexes are 1-based, but your first call assigns to secos[[0]] which fails.
A literal translation of this into a better start is something like the following. (Up front, your reference to pp only makes sense if you have an object pp that you used to create your data.frame above ... since pp[.] by itself will not reference the frame. If you're using attach(.) to be able to do that, then ... don't. Too many risks and things that can go wrong with it, it is one of the base functions I'd vote to remove.)
invec <- 0:336
L <- sapply(invec, function(i) {
hola=as.data.frame(pp[c(1+i:29 + i)])
sum(hola > 0.1)
})
secos <- data.frame(i = invec, secos = L)
An alternative:
L <- lapply(invec, function(i) {
hola=as.data.frame(pp[c(1+i:29 + i)])
data.frame(secos = sum(hola > 0.1))
})
out <- do.call(rbind, L)
I can't help but think there is a more efficient, R-idiomatic way to aggregate this data. My guess is that it's a moving window of sorts, perhaps a month wide (or similar). If that's the case, I recommend looking into zoo::rollapply(pp, 30, function(z) sum(z > 0.1)), perhaps with meaningful application of align=, partial=, and/or fill=.
Related
(1,2,2,3,3,3,4,4,4,4,...,n,...,n)
I want to make the above vector by for loop, but not using rep function or the other functions. This may not be good question to ask in stackoverflow, but since I am a newbie for R, I dare ask here to be helped.
(You can suppose the length of the vector is 10)
With a for loop, it can be done with
n <- 10
out <- c()
for(i in seq_len(n)){
for(j in seq_len(i)) {
out <- c(out, i)
}
}
In R, otherwise, this can be done as
rep(seq_len(n), seq_len(n))
I have been beaten by #akrun by seconds, even so I'd like to give you a few hints if using rep would have been possible which may help you with R in general. (Without rep usage, just look at #akrun)
Short answer using rep
rep(1:n, 1:n)
Long Answer using rep
Before posting a question you should try to develop your own solutions and share them.
Trying googling a bit and sharing what you already found is usually good as well. Please, have a look at "help/how-to-ask"
Let's try to do it together.
First of all, we should try to have a look at official sources:
R-project "getting help", here you can see the standard way to get a function's documentation is just typing ?func_name in your R console
R-project "official manuals" offer a good introduction to R. Try looking at the first topic, "An Introduction to R"
From the previous two (and other sources as well) you will find two interesting functions:
: operator: it can be used to generate a sequence of integers from a to b like a:b. Typing 1:3, for instance, gives you the 1, 2, 3 vector
rep(x, t) is a function which can be used to replicate the item(s) x t times.
You also need to know R is "vector-oriented", that is it applies functions over vectors without you typing explicits loops.
For instance, if you call repl(1:3, 2), it's (almost) equivalent to running:
for(i in 1:3)
rep(i, 2)
By combining the previous two functions and the notion R is "vector-oriented", you get the rep(1:n, 1:n) solution.
I am not sure why you don't want to use rep, but here is a method of not using it or any functions similar to rep within the loop.
`for (i in 1:10){
a<-NA
a[1:i] <- i
if (i==1){b<-a}
else if (i >1){b <- c(b,a)}
assign("OutputVector",b,envir = .GlobalEnv)
}`
`OutputVector`
Going for an n of ten seemed subjective so I just did the loop for numbers 1 through 10 and you can take the first 10 numbers in the vector if you want. OutputVector[1:10]
You can do this with a single loop, though it's a while rather than a for
n <- 10
x <- 1;
i <- 2;
while(i <= n)
{
x <- c(x, 1/i);
if(sum(x) %% 1 == 0) i = i + 1;
}
1/x
I am simulating data and filling a matrix using a for loop in R. Currently the loop is running slower than I would like. I've done some work to vectorize some of the variables to improve the loops speed but it still taking some time. I believe the
mat[j,year] <- sum(vec==1)/x
part of the loop is slowing things down. I've looked into filling matrices more efficiently but could not find anything to help my current problem. Eventually this will be used as a part of a shiny app so all of variables I assign will need to be easily assigned different values.
Any advice to speed up the loop or more efficiently write this loop would be greatly appreciated.
Here is the loop:
#These variables are all specified because they need to change with different simulations
num.sims <- 20
time <- 50
mat <- matrix(nrow = num.sims, ncol = time)
x <- 1000
init <- 0.5*x
vec <- vector(length = x)
ratio <- 1
freq <- -0.4
freq.vec <- numeric(nrow(mat))
## start a loop
for (j in 1:num.sims) {
vec[1:init] <- 1; vec[(init+1):x] <- 2
year <- 2
freq.vec[j] <- sum(vec==1)/x
for (i in 1:(x*(time-1))) {
freq.1 <- sum(vec==1)/x; freq.2 <- 1 - freq.1
fit.ratio <- exp(freq*(freq.1-0.5) + log(ratio))
Pr.1 <- fit.ratio*freq.1/(fit.ratio*freq.1 + freq.2)
vec[ceiling(x*runif(1))] <- sample(c(1,2), 1, prob=c(Pr.1,1-Pr.1))
## record data
if (i %% x == 0) {
mat[j,year] <- sum(vec==1)/x
year <- year + 1
}}}
The inner loop is what is slowing you down. You're doing x number of iterations to update each cell in the matrix. Since each trip to modify vec depends on the previous iteration, this would be difficult to simplify. #Andrew Feierman is probably correct that this would benefit from being moved to C++, at least the four lines before the if statement.
Alternatively, this only takes 10-20 seconds to run. Unless you're going to scale this up or run it many times, it might not be worth the trouble to speed it up. If you do keep it as is, you could put a progress bar in Shiny to let the user know things are still working.
Depending on how often you will need to call this loop, it could be worth rewriting it in C++. R is built on C++, and any C++ will run many, many times faster than even efficient R code.
sourceCpp is a good package to start with: https://www.rdocumentation.org/packages/Rcpp/versions/0.12.11/topics/sourceCpp
I need to scale up a set of files for a proof of concept in my company. Essentially have several 1000row files with around 200 columns each, and I want to rbind them until I reach the desired scale. This might be 1Million rows or more.
The output will be essentially a repetition of data (sounds a bit silly) and I'm aware of that, but i just need to prove something.
I used a while loop in R similar to this:
while(nrow(df) < 1000000) {df <- rbind(df,df);}
This seems to work but it looks a bit computationally heavy. It might might take like 10-15minutes.
I though of creating a function (below) and use an "apply" family function on the df, but couldn't succeed:
scaleup_function <- function(x)
{
while(nrow(df) < 1000)
{
x <- rbind(df, df)
}
}
Is there a quicker and more efficient way of doing it (it doesn't need to be with rbind) ?
Many thanks,
Joao
This should do the trick:
df <- matrix(0,nrow=1000,ncol=200)
reps_needed <- ceiling(1000000 / nrow(df))
df_scaled <- df[rep(1:nrow(df),reps_needed),]
I have a df, YearHT, 6.5M x 55 columns. There is specific information I want to extract and add but only based on an aggregate values. I am using a for loop to subset the large df, and then performing the computations.
I have heard that for loops should be avoided, and I wonder if there is a way to avoid a for loop that I have used, as when I run this query it takes ~3hrs.
Here is my code:
srt=NULL
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
e=unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
srt=rbind(srt,e)
}
srt=data.frame(srt)
s2=data.frame(srt$X2,srt$X1,srt$X3)
colnames(s2)=colnames(srt)
s=rbind(srt,s2)
doubletCounts is 700 x 3 df, and each of the values is found within the large df.
I would be glad to hear any ideas to optimize/speed up this process.
Here is a fast solution using data.table , although it is not completely clear from your question what is the output you want to get.
# load library
library(datat.table)
# convert your dataset into data.table
setDT(YearHT)
# subset YearHT keeping values that are present in doubletCounts$Var1
YearHT_df <- YearHT[ berthlet %in% doubletCounts$Var1]
# aggregate values
output <- YearHT_df[ , .( median= median(berthtime)) ]
for loops aren't necessarily something to avoid, but there are certain ways of using for loops that should be avoided. You've committed the classic for loop blunder here.
srt = NULL
for (i in index)
{
[stuff]
srt = rbind(srt, [stuff])
}
is bound to be slower than you would like because each time you hit srt = rbind(...), you're asking R to do all sorts of things to figure out what kind of object srt needs to be and how much memory to allocate to it. When you know what the length of your output needs to be up front, it's better to do
srt <- vector("list", length = doubletCounts$Var1)
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
srt[[i]] = unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
srt=data.frame(srt)
Or the apply alternative of
srt = lapply(doubletCounts$Var1,
function(i)
{
s=subset(YearHT,YearHT$berthlet==i)
unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
)
Both of those should run at about the same speed
(Note: both are untested, for lack of data, so they might be a little buggy)
Something else you can try that might have a smaller effect would be dropping the subset call and use indexing. The content of your for loop could be boiled down to
unlist(c(strsplit(i, '\\|'),
median(YearHT[YearHT$berthlet == i, "berthtime"])))
But I'm not sure how much time that would save.
I am having trouble optimising a piece of R code. The following example code should illustrate my optimisation problem:
Some initialisations and a function definition:
a <- c(10,20,30,40,50,60,70,80)
b <- c(“a”,”b”,”c”,”d”,”z”,”g”,”h”,”r”)
c <- c(1,2,3,4,5,6,7,8)
myframe <- data.frame(a,b,c)
values <- vector(length=columns)
solution <- matrix(nrow=nrow(myframe),ncol=columns+3)
myfunction <- function(frame,columns){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
The problematic for-loop looks like this:
columns = 6
for(i in 1:nrow(myframe){
values <- myfunction(as.matrix(myframe[i,]), columns)
values[columns+2] = i
values[columns+3] = myframe[i,3]
#more columns added with simple operations (i.e. sum)
solution <- rbind(solution,values)
#solution is a large matrix from outside the for-loop
}
The problem seems to be the rbind function. I frequently get error messages regarding the size of solution which seems to be to large after a while (more than 50 MB).
I want to replace this loop and the rbind with a list and lapply and/or foreach. I have started with converting myframeto a list.
myframe_list <- lapply(seq_len(nrow(myframe)), function(i) myframe[i,])
I have not really come further than this, although I tried applying this very good introduction to parallel processing.
How do I have to reconstruct the for-loop without having to change myfunction? Obviously I am open to different solutions...
Edit: This problem seems to be straight from the 2nd circle of hell from the R Inferno. Any suggestions?
The reason that using rbind in a loop like this is bad practice, is that in each iteration you enlarge your solution data frame and then copy it to a new object, which is a very slow process and can also lead to memory problems. One way around this is to create a list, whose ith component will store the output of the ith loop iteration. The final step is to call rbind on that list (just once at the end). This will look something like
my.list <- vector("list", nrow(myframe))
for(i in 1:nrow(myframe)){
# Call all necessary commands to create values
my.list[[i]] <- values
}
solution <- rbind(solution, do.call(rbind, my.list))
A bit to long for comment, so I put it here:
If columns is known in advance:
myfunction <- function(frame){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
apply(myframe, 2, myfunction)
If columns is not given via environment, you can use:
apply(myframe, 2, myfunction, columns) with your original myfunction definition.