I need to split a dataframe into N parts which shoud have a given length ("proglen").
I created a for-loop which gives me the desired result as a list of Dataframes.
Now I want to change the for-loop to a vector based code.
I'm not sure how to convert a loop into vector based code in R.
obslen = 600 ;proglen = 50 ; N = 50 ; testdat <- list()
for (i in 1:N){
test <- df[df$d >= df$d[obslen + i * proglen] &
df$d < df$d[proglen + obslen +i * proglen],]
testdat[[i]] <- test
}
The variable $d has the type Posixct.
The result should contain N Dataframes.
These dfs start at the day [obslen + i * proglen] and have the length "proglen".
This is almost similar to what you have done, just replaced by lapply and a little bit more readable. This will give you a list of data frames. This can also be done with recursive function but will be difficult to understand (in my view).
lapply(1:N, function(idx){
this_idx <- obslen + idx * proglen
next_idx <- obslen + (idx + 1) * proglen
df[this_idx:next_idx, ]
})
If the dates are distinct then you can always use the split function:
testdat <- split(df[obslen + seq(1, N * proglen), ], rep(seq(N), each = proglen))
Related
I hope to get help on the following problem in R.
I have the folowing code to generate 30 column dataset based on an exponential distribuition:
x0=0
xmax=8000
xout=3000
lambda=0.0002
n=1
x1=x0+rexp(n,lambda)-xout
x2=x1+rexp(n,lambda)-xout
x3=x2+rexp(n,lambda)-xout
x4=x3+rexp(n,lambda)-xout
x5=x4+rexp(n,lambda)-xout
x6=x5+rexp(n,lambda)-xout
x7=x6+rexp(n,lambda)-xout
x8=x7+rexp(n,lambda)-xout
x9=x8+rexp(n,lambda)-xout
x10=x9+rexp(n,lambda)-xout
x11=x10+rexp(n,lambda)-xout
x12=x11+rexp(n,lambda)-xout
x13=x12+rexp(n,lambda)-xout
x14=x13+rexp(n,lambda)-xout
x15=x14+rexp(n,lambda)-xout
x16=x15+rexp(n,lambda)-xout
x17=x16+rexp(n,lambda)-xout
x18=x17+rexp(n,lambda)-xout
x19=x18+rexp(n,lambda)-xout
x20=x19+rexp(n,lambda)-xout
x21=x20+rexp(n,lambda)-xout
x22=x21+rexp(n,lambda)-xout
x23=x22+rexp(n,lambda)-xout
x24=x23+rexp(n,lambda)-xout
x25=x24+rexp(n,lambda)-xout
x26=x25+rexp(n,lambda)-xout
x27=x26+rexp(n,lambda)-xout
x28=x27+rexp(n,lambda)-xout
x29=x28+rexp(n,lambda)-xout
x30=x29+rexp(n,lambda)-xout
I have three doubts:
1 - Is there any way to write this function in a reduced form?
2 - This row (30 columns) needs to be simulated 10,000 times. How to do this in a loop?
3 - The values of each cell (x1, x2, x3 ...) must be limited to the interval x0 and xmax (0-8000). How to do this?
That depends on what you want to do with values over 8000. Here's a solution that just takes those values and wraps them around with a modulo operator.
library(tidyverse)
test <- data.frame(x0 = rep(0, n))
for (i in 1:30) {
new_col <- sym(paste0("x", i))
old_col <- sym(paste0("x", i - 1))
test <- test %>%
mutate(!!new_col := (!!old_col + rexp(n, lambda) - xout) %% xmax)
}
I don't know how familiar you may or may not be with the tidyverse and tidy evaluation, which I've used here liberally. The !! operator, combined with sym(), turns the variable names into actual variables. The %>% operator "pipes" data from one function to the next. The := operator is needed only if you want to make assignments with a !! on the lefthand side.
I think this is my first time actually trying to post an answer on StackOverflow, so be easy on me! :)
As I'm fairly new to R myself, I thought it would be good practice to try to write this out. Perhaps not the most efficient code, but it works:
xmax <- 8000
xout <- 3000
lambda <- 0.0002
n <- 1
iterations <- 30
df <- data.frame(matrix(ncol = 31, nrow = iterations))
names(df) <- c(paste("x", 0:30, sep=""))
for (j in 1:iterations) {
df$x0[j] <- 0
df$x1[j] <- df$x0[j] + rexp(n,lambda)-xout
if (df$x1[j] < 0) {
df$x1[j] <- 0
}
if (df$x1[j] > 8000) {
df$x1[j] <- 8000
}
for (i in 3:31) {
df[j,i] <- df[j, i-1] + rexp(n,lambda)-xout
if (df[j,i] < 0) {
df[j,i] <- 0
}
if (df[j,i] > 8000) {
df[j,i] <- 8000
}
}
}
You can change iterations to 30000, for testing purposes I've used 30. Also I didn't know if you wanted to limit to 0 and 8000 before or after the next iterations, I've done it before.
Is there any way to write this function in a reduced form?
I would do it like this. Pretty sure this is equivalent.
ncol = 30
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
This row (30 columns) needs to be simulated 10,000 times. How to do this in a loop?
Use replicate with the code above:
sim_data = t(replicate(10000, {
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
}))
replicate gives 10000 columns and 30 rows. We use t() to transpose it to 10000 rows with 30 columns.
The values of each cell (x1, x2, x3 ...) must be limited to the interval x0 and xmax (0-8000). How to do this?
Use pmin() and pmax(). Not sure if you want this done before or after the cumulative summing...
sim_data = t(replicate(10000, {
row = rexp(ncol, lambda)
row = cumsum(row) - xout * (1:ncol)
row = pmax(0, row)
row = pmin(xmax, row)
row
}))
I have an empty data frame T_modelled with 2784 columns and 150 rows.
T_modelled <- data.frame(matrix(ncol = 2784, nrow = 150))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))
where
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
I filled T_modelled by column with a nested for loop, based on a formula:
for (i in 1:ncol(T_modelled)) {
col_tmp <- colnames(T_modelled)[i]
for (j in 1:nrow(T_modelled)) {
z_tmp <- z[j]-0.1
T_tmp <- MANSRT+As*e^(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
T_modelled[j ,col_tmp] <- T_tmp
}
}
where
MANSRT <- -2.051185
As <- 11.59375
omega <- (2*pi)/(347.875*24*60*60)
c <- 790
k <- 0.00219
pb <- 2600
K <- (k*1000)/(c*pb)
e <- exp(1)
I do get the desired results but I keep thinking there must be a more efficient way of filling that data frame. The loop is quite slow and looks cumbersome to me. I guess there is an opportunity to take advantage of R's vectorized way of calculating. I just cannot see myself how to incorporate the formula in an easier way to fill T_modelled.
Anyone got any ideas how to get the same result in a faster, more "R-like" manner?
I believe this does it.
Run this first instruction right after creating T_modelled, it will be needed to test that the results are equal.
Tm <- T_modelled
Now run your code then run the code below.
z_tmp <- z - 0.1
for (i in 1:ncol(Tm)) {
T_tmp <- MANSRT + As*exp(-z_tmp*(omega/(2*K))^0.5)*sin(omega*t_sec_ERT[i]-((omega/(2*K))^0.5)*z_tmp)
Tm[ , i] <- T_tmp
}
all.equal(T_modelled, Tm)
#[1] TRUE
You don't need the inner loop, that's the only difference.
(I also used exp directly but that is of secondary importance.)
Much like your previous question's solution which you accepted, consider simply using sapply, iterating through the vector, t_sec_ERT, which is the same length as your desired dataframe's number of columns. But first adjust every element of z by 0.1. Plus, there's no need to create empty dataframe beforehand.
z_adj <- z - 0.1
T_modelled2 <- data.frame(sapply(t_sec_ERT, function(ert)
MANSRT+As*e^(-z_adj*(omega/(2*K))^0.5)*sin(omega*ert-((omega/(2*K))^0.5)*z_adj)))
colnames(T_modelled2) <- paste0("t=", t_sec_ERT)
rownames(T_modelled2) <- paste0("z=", z)
all.equal(T_modelled, T_modelled2)
# [1] TRUE
Rui is of course correct, I just want to suggest a way of reasoning when writing a loop like this.
You have two numeric vectors. Functions for numerics in R are usually vectorized. By which I mean you can do stuff like this
x <- c(1, 6, 3)
sum(x)
not needing something like this
x_ <- 0
for (i in x) {
x_ <- i + x_
}
x_
That is, no need for looping in R. Of course looping takes place none the less, it just happens in the underlying C, Fortran etc. code, where it can be done more efficiently. This is usually what we mean when we call a function vectorized: looping takes place "under the hood" as it were. The output of Vectorize() thus isn't strictly vectorized by this definition.
When you have two numeric vectors you want to loop over you have to first see if the constituent functions are vectorized, usually by reading the docs.
If it is, you continue by constructing that central vectorized compound function and and start testing it with one vector and one scalar. In your case it would be something like this (testing with just the first element of t_sec_ERT).
z_tmp <- z - 0.1
i <- 1
T_tmp <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
Looks OK. Then you start looping over the elements of t_sec_ERT.
T_tmp <- matrix(nrow=length(z), ncol=length(t_sec_ERT))
for (i in 1:length(t_sec_ERT)) {
T_tmp[, i] <- MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*t_sec_ERT[i] - ((omega/(2*K))^0.5)*z_tmp)
}
Or you can do it with sapply() which is often neater.
f <- function(x) {
MANSRT + As *
exp(-z_tmp*(omega/(2*K))^0.5) *
sin(omega*x - ((omega/(2*K))^0.5)*z_tmp)
}
T_tmp <- sapply(t_sec_ERT, f)
I would prefer to put the data in a long format, with all combinations of z and t_sec_ERT as two columns, in order to take advantage of vectorization. Although I usually prefer tidyr for switching between long and wide formats, I've tried to keep this as a base solution:
t_sec_ERT <- seq(from = -23349600, to = 6706800, by = 10800)
z <- seq(from = 0.1, to = 15, by = 0.1)
v <- expand.grid(t_sec_ERT, z)
names(v) <- c("t_sec_ERT", "z")
v$z_tmp <- v$z-0.1
v$T_tmp <- MANSRT+As*e^(-v$z_tmp*(omega/(2*K))^0.5)*sin(omega*v$t_sec_ERT-((omega/(2*K))^0.5)*v$z_tmp)
T_modelled <- data.frame(matrix(v$T_tmp, nrow = length(z), ncol = length(t_sec_ERT), byrow = TRUE))
names(T_modelled) <- paste0("t=", t_sec_ERT)
rownames(T_modelled) <- paste0("z=", seq(from = 0.1, to = 15, by = 0.1))
I want to multiply and then sum the unique pairs of a vector, excluding pairs made of the same element, such that for c(1:4):
(1*2) + (1*3) + (1*4) + (2*3) + (2*4) + (3*4) == 35
The following code works for the example above:
x <- c(1:4)
bar <- NULL
for( i in 1:length(x)) { bar <- c( bar, i * c((i+1) : length(x)))}
sum(bar[ 1 : (length(bar) - 2)])
However, my actual data is a vector of rational numbers, not integers, so the (i+1) portion of the loop will not work. Is there a way to look at the next element of the set after i, e.g. j, so that I could write i * c((j : length(x))?
I understand that for loops are usually not the most efficient approach, but I could not think of how to accomplish this via apply etc. Examples of that would be welcome, too. Thanks for your help.
An alternative to a loop would be to use combn and multiply the combinations using the FUN argument. Then sum the result:
sum(combn(x = 1:4, m = 2, FUN = function(x) x[1] * x[2]))
# [1] 35
Even better to use prod in FUN, as suggested by #bgoldst:
sum(combn(x = 1:4, m = 2, FUN = prod))
I am trying to split my data set using two parameters, the fraction of missing values and "maf", and store the sub-data sets in a list. Here is what I have done (it's not working). Any help will be appreciated,
Thanks.
library(BLR)
library(missForest)
data(wheat)
X2<- prodNA(X, 0.4) ### creating missing values
dim(X2)
fd<-t(X2)
MAF<-function(geno){ ## markers are in the rows
geno[(geno!=0) & (geno!=1) & (geno!=-1)] <- NA
geno <- as.matrix(geno)
## calc_Freq for alleles
n0 <- apply(geno==0,1,sum,na.rm=T)
n1 <- apply(geno==1,1,sum,na.rm=T)
n2 <- apply(geno==-1,1,sum,na.rm=T)
n <- n0 + n1 + n2
## calculate allele frequencies
p <- ((2*n0)+n1)/(2*n)
q <- 1 - p
maf <- pmin(p, q)
maf}
frac.missing <- apply(fd,1,function(z){length(which(is.na(z)))/length(z)})
maf<-MAF(fd)
lst<-matrix()
for (i in seq(0.2,0.7,by =0.2)){
for (j in seq(0,0.2,by =0.005)){
lst=fd[(maf>j)|(frac.missing < i),]
}}
It sounds like you want the results that the split function provides.
If you have a vector, "frac.missing" and "maf" is defined on the basis of values in "fd" (and has the same length as the number of rows in fd"), then this would provide the split you are looking for:
spl.fd <- split(fd, list(maf, frac.missing) )
If you want to "group" the fd values basesd on of maf(fd) and frac.missing within the bands specified by your for-loop, then the same split-construct may do what your current code is failing to accomplish:
lst <- split( fd, list(cut(maf(fd), breaks = seq(0,0.2,by =0.005) ,
include.lowest=TRUE),
cut(frac.missing, breaks = seq(0.2,0.7,by =0.2),
right=TRUE,include.lowest=TRUE)
)
)
The right argument accomodates the desire to have the splits based on a "<" operator whereas the default operation of cut presumes a ">" comparison against the 'breaks'. The other function that provides similar facility is by.
the below codes give me exactly what i need:
Y<-t(GBS.binary)
nn<-colnames(Y)
fd<-Y
maf<-as.matrix(MAF(Y))
dff<-cbind(frac.missing,maf,Y)
colnames(dff)<-c("fm","maf",nn)
dff<-as.data.frame(dff)
for (i in seq(0.1,0.6,by=0.1)) {
for (j in seq(0,0.2,by=0.005)){
assign(paste("fm_",i,"maf_",j,sep=""),
(subset(dff, maf>j & fm <i))[,-c(1,2)])
} }
I have a working solution to my problem, but I will not be able to use it because it is so slow (my calculations predict that the whole simulation will take 2-3 years!). Thus I am looking for a better (faster) solution. This is (in essence) the code I am working with:
N=4
x <-NULL
for (i in 1:N) { #first loop
v <-sample(0:1, 1000000, 1/2) #generate data
v <-as.data.frame(v) #convert to dataframe
v$t <-rep(1:2, each=250) #group
v$p <-rep(1:2000, each=500) #p.number
# second loop
for (j in 1:2000) { #second loop
#count rle for group 1 for each pnumber
x <- rbind(x, table(rle(v$v[v$t==1&v$p==j])))
#count rle for group 2 for each pnumber
x <- rbind(x, table(rle(v$v[v$t==2&v$p==j])))
} #end second loop
} #end first loop
#total rle counts for both group 1 & 2
y <-aggregate(x, list(as.numeric(rownames(x))), sum)
In words: The code generates a coin-flip simulation (v). A group factor is generated (1 & 2). A p.number factor is generated (1:2000). The run lengths are recorded for each p.number (1:2000) for both groups 1 & group 2 (each p.number has runs in both groups). After N loops (the first loop), the total run lengths are presented as a table (aggregate) (that is, the run lengths for each group, for each p.number, over N loops as a total).
I need the first loop because the data that I am working with comes in individual files (so I'm loading the file, calculating various statistics etc and then loading the next file and doing the same). I am much less attached to the second loop, but can't figure out how to replace it with something faster.
What can be done to the second loop to make it (hopefully, a lot) faster?
You are committing the cardinal sin of growing an object within a for() loop in R. Don't (I repeat don't) do this. Allocate sufficient storage for x at the beginning and then fill in x as you go.
x <- matrix(nrow = N * (2000 * 2), ncol = ??)
Then in the inner loop
x[ii, ] <- table(rle(....))
where ii is a loop counter that you initialise to 1 before the first loop and increment within the second loop:
x <- matrix(nrow = N * (2000 * 2), ncol = ??)
ii <- 1
for(i in 1:N) {
.... # stuff here
for(j in 1:2000) {
.... # stuff here
x[ii, ] <- table(rle(....))
## increment ii
ii <- ii + 1
x[ii, ] <- table(rle(....))
## increment ii
ii <- ii + 1
} ## end inner loop
} ## end outer loop
Also note that you are reusing index i in bot for()loops which will not work.iis just a normal R object and so bothfor()loops will be overwriting it as the progress. USej` for the second loop as I did above.
Try that simple optimisation first and see if that will allow the real simulation to complete in an acceptable amount of time. If not, come back with a new Q showing the latest code and we can think about other optimisations. The optimisation above is simple to do, optimising table() and rle() might take a lot more work. Noting that, you might look at the tabulate() function which does the heavy lifting in table(), which might be one avenue for optimising that particular step.
If you just want to run rle and table for each combination of the values of v$t and v$p separately, there is no need for the second loop. It is much faster in this way:
values <- v$v + v$t * 10 + v$p * 100
runlength <- rle(values)
runlength$values <- runlength$values %% 2
x <- table(runlength)
y <- aggregate(unclass(x), list(as.numeric(rownames(x))), sum)
The whole code will look like this. If N is as low as 4, the growing object x will not be a severe problem. But generally I agree with #GavinSimpson, that it is not a good programming technique.
N=4
x <-NULL
for (i in 1:N) { #first loop
v <-sample(0:1, 1000000, 1/2) #generate data
v <-as.data.frame(v) #convert to dataframe
v$t <-rep(1:2, each=250) #group
v$p <-rep(1:2000, each=500) #p.number
values <- v$v + N * 10 + v$t * 100 + v$p * 1000
runlength <- rle(values)
runlength$values <- runlength$values %% 2
x <- rbind(x, table(runlength))
} #end first loop
y <-aggregate(x, list(as.numeric(rownames(x))), sum) #tota