Function that returns the numbers of subintervals? - r

I'm trying create a function that the user enters with a vector (file) and the function returns the number of subintervals possible of differents sizes without exceeding the dimensions of the vector. The function has worked well for many values, but for some specific values has exceeded the dimensions and i don't know why. Follows the reproducible example, note that when box_size = 101 (size of subintervals) them the function returns ninbox = 5 (number of subintervals) and not 4 how should be.
file = rnorm(500);N<-length(file)
box_size <-c(92,101,111)
j=1;ninbox2<-0;aux_ninbox<-0;aux_ninbox[1]<-box_size[j];ninbox<-0;sum_box<-0
for(j in 1:length(box_size)){
while(aux_ninbox<=N){
ninbox<-ninbox+1
sum_box[ninbox]<- box_size[j]
aux_ninbox<-sum_box[ninbox]+ aux_ninbox
ninbox2[j]<-ninbox
}
aux_ninbox<-0;aux_ninbox[1]<-box_size[j];ninbox<-0;sum_box<-0
}
ninbox2
For instance, if the size of the subinterval is 101 and the size of the vector is 500, them the function count how many subintervals of size 101 fit in 500. In this case, 101+101+101+101 = 404 (4 subintervals) because the next exceeding the dimension of the vector (500). This function have an error, because is returning 5 for intervals of size 101. But note that for intervals of size 92 and 111 is working perfectly!

Okay, now I get it. I simplified your code a little bit:
file <- rnorm(500)
N <- length(file)
box_size <- c(92,101,111)
ninbox2 <- 0
for (j in 1:length(box_size)){
aux_ninbox <- box_size[j]
ninbox <- 0
sum_box <- 0
while (aux_ninbox<=N){
ninbox <- ninbox+1
sum_box[ninbox] <- box_size[j]
aux_ninbox <- sum_box[ninbox]+ aux_ninbox
ninbox2[j] <- ninbox
}
}
ninbox2
There were a few issues when your variables were declared:
aux_ninbox[1] <- box_size[j]
should not be assigned at the end of your for-loop. You have to put it at the beginning, before the while-loop starts. That caused your error, since the second iteration of you for-loop again used the box_size of 92 for the calculation.
If you just want to get the number of boxes, you could simply use
N %/% box_size
This divides the length of your given vector by the sizes of your boxes ignoring the rests. See "Arithmetic Operators" in R help for more information about div and mod.

Related

How to save values in Vector using R

I am supposed to find the mean and standard deviation at each given sample size (N), using the "FOR LOOP". I started writing the code as below, I am required to save all the means into vector "p". How do I save all the means into one vector?
sample.sizes =c(3,10,50,100,500,1000)
mean.sds = numeric(0)
for ( N in sample.sizes ){
x <- rnorm(3,mean=0,sd=1)
mean.sds[i]
}
mean(x)
Actually you are doing many thing wrong?
If you are using variable N in for loop, you are not using it anywhere
for (N in 'some_vector') actually means N will take that value one by one. So N in sample sizes will first take, 3 then 10 then 50 and so on.
Now where does i come into picture?
You are calculating x for each iteration of N. In fact you are not using N anywhere in the loop?
first x will return 3 values. In the next line you intend to store these three values in just ith value of mean.sds where i is unknown and storing three values into one value, as it is, is not logically possible.
Do you want this?
sample.sizes =c(3,10,50,100,500,1000)
mean.sds = numeric(0)
for ( i in seq_along(sample.sizes )){
x <- rnorm(sample.sizes[i], mean=0, sd=1)
mean.sds[i] <- mean(x)
}
mean.sds
[1] 0.6085489531 -0.1547286299 0.0052106559 -0.0452804986 -0.0374094936 0.0005667246
I replaced N with seq_along(sample.sizes) which will give iterations equal to the number of that vector. Six in this example.
I passed each ith element to first argument of rnorm to generate these many random values.
Stored each random value into single vector. calculated its mean (one value only) and stored in ith value of your empty vector.

Expected value of the difference between a sum of variables and a threshold

I had a custom deck consisting of eight cards of the sequence 2^n, n=0,..,6. I draw cards (without replacement) until the sum is equal or greater than the threshold. How can I implement in R a function that calculates the mean of the difference between the sum and the threshold??
I tried to do it using this How to store values in a vector with nested functions
but it takes ages... I think there is a way to do it with probabilities/simulations but I can figure out.
The threshold could be greater than the value of one single card, ie, threshold=500 or less than the value of a single card, ie, threshold=50
What I have done so far is to find all the subsets that meet the condition of the sum greater or equal to the threshold. Then I will only substract the threshold and calculate the mean.
I am using the following code in R. For a small set I get the answer quite fast. However, I have been running the function for several ours with the set containing the 56 numbers and is still working.
set<-c(rep(1,8),rep(2,8), rep(4,8),rep(8,8),rep(16,8),rep(32,8),rep(64,8))
recursive.subset <-function(x, index, current, threshold, result){
for (i in index:length(x)){
if (current + x[i] >= threshold){
store <<- append(store, sum(c(result,x[i])))
} else {
recursive.subset(x, i + 1, current+x[i], threshold, c(result,x[i]))
}
}
}
store <- vector()
inivector <- vector(mode="numeric", length=0) #initializing empty vector
recursive.subset (set, 1, 0, threshold, inivector)
I don't know if it is possible to get an exact solution, simply because there are so many possible combinations. It is probably better to do simulations, i.e. write a script for 1 full draw and then rerun that script many times. Since the solutions are very similar, the simulation should give a pretty good approximation.
Ok, here goes:
set <- rep(2^(0:6), each = 8)
thr <- 500
fun <- function(set,thr){
x <- cumsum(sample(set))
value <- x[min(which(x >= thr))]
value
}
system.time(a <- replicate(100000, fun(set,thr)))
# user system elapsed
# 1.10 0.00 1.09
mean(a - thr)
# [1] 21.22992
Explanation: Rather than drawing a card one at a time, I draw all cards simultaneously (sample) and then calculate the cumulative sum (cumsum). I then find the point where the cards at up to the threshold or larger, and find the sum of those cards back in x. We run this function many times with replicate, to obtain a vector of the values. We use mean(a-thr) to calculate the mean difference.
Edit: Made a really stupid typo in the code, fixed it now.
Edit2: Shortened the function a little.

simulate x percentage of missing and error in data in r

I would like to perform two things to my fairly large data set about 10 K x 50 K . The following is smaller set of 200 x 10000.
First I want to generate 5% missing values, which perhaps simple and can be done with simple trick:
# dummy data
set.seed(123)
# matrix of X variable
xmat <- matrix(sample(0:4, 2000000, replace = TRUE), ncol = 10000)
colnames(xmat) <- paste ("M", 1:10000, sep ="")
rownames(xmat) <- paste("sample", 1:200, sep = "")
Generate missing values at 5% random places in the data.
N <- 2000000*0.05 # 5% random missing values
inds_miss <- round ( runif(N, 1, length(xmat)) )
xmat[inds_miss] <- NA
Now I would like to generate error (means that different value than what I have in above matrix. The above matrix have values of 0 to 4. So what I would like to do:
(1) I would like to replace x value with another value that is not x (for example 0 can be replaced by a random sample of that is not 0 (i.e. 1 or 2 or 3 or 4), similarly 1 can be replaced by that is not 1 (i.e. 0 or 2 or 3 or 4). Indicies where random value can be replaced can be simply done with:
inds_err <- round ( runif(N, 1, length(xmat)) )
If I randomly sample 0:4 values and replace with the indices, this will sometime replace same value with same value ( 0 with 0, 1 with 1 and so on) without creating error.
errorg <- sample(0:4, length(inds_err), replace = TRUE)
xmat[inds_err] <- errorg
(2) So what I would like to do is introduce error in xmat with missing values, However I do not want NA generated in above step be replaced with a value (0 to 4). So ind_err should not be member of vector inds_miss.
So summary rules :
(1) The missing values should not be replaced with error values
(2) The existing value must be replaced with different value (which is definition of error here)- in random sampling this 1/5 probability of doing this.
How can it be done ? I need faster solution that can be used in my large dataset.
You can try this:
inds_err <- setdiff(round ( runif(2*N, 1, length(xmat)) ),inds_miss)[1:N]
xmat[inds_err]<-(xmat[inds_err]+sample(4,N,replace=TRUE))%%5
With the first line you generate 2*N possible error indices, than you subtract the ones belonging to inds_miss and then take the first N. With the second line you add to the values you want to change a random number between 1 and 4 and than take the mod 5. In this way you are sure that the new value will be different from the original and stil in the 0-4 range.
Here's an if/else solution that could work for you. It is a for loop so not sure if that will be okay for you. Possibly vectorize it is some way to make it faster.
# vector of options
vec <- 0:4
# simple logic based solution if just don't want NA changed
for(i in 1:length(inds_err){
if(is.na(xmat[i])){
next
}else{
xmat[i] <- sample(vec[-xmat[i]], 1)
}
}

how to select a matrix column based on column name

I have a table with shortest paths obtained with:
g<-barabasi.game(200)
geodesic.distr <- table(shortest.paths(g))
geodesic.distr
# 0 1 2 3 4 5 6 7
# 117 298 3002 2478 3342 3624 800 28
I then build a matrix with 100 rows and same number of columns as length(geodesic.distr):
geo<-matrix(0, nrow=100, ncol=length(unlist(labels(geodesic.distr))))
colnames(geo) <- unlist(labels(geodesic.distr))
Now I run 100 experiments where I create preferential attachment-based networks with
for(i in seq(1:100)){
bar <- barabasi.game(vcount(g))
geodesic.distr <- table(shortest.paths(bar))
distance <- unlist(labels(geodesic.distr))
for(ii in distance){
geo[i,ii]<-WHAT HERE?
}
}
and for each experiment, I'd like to store in the matrix how many paths I have found.
My question is: how to select the right column based on the column name? In my case, some names produced by the simulated network may not be present in the original one, so I need not only to find the right column by its name, but also the closest one (suppose my max value is 7, I may end up with a path of length 9 which is not present in the geo matrix, so I want to add it to the column named 7)
There is actually a problem with your approach. The length of the geodesic.distr table is stochastic, and you are allocating a matrix to store 100 realizations based on a single run. What if one of the 100 runs will give you a longer geodesic.distr vector? I assume you want to make the allocated matrix bigger in this case. Or, even better, you want run the 100 realizations first, and allocate the matrix after you know its size.
Another potential problem is that if you do table(shortest.paths(bar)), then you are (by default) considering undirected distances, will end up with a symmetric matrix and count all distances (expect for self-distances) twice. This may or may not be what you want.
Anyway, here is a simple way, with the matrix allocated after the 100 runs:
dists <- lapply(1:100, function(x) {
bar <- barabasi.game(vcount(g))
table(shortest.paths(bar))
})
maxlen <- max(sapply(dists, length))
geo <- t(sapply(dists, function(d) c(d, rep(0, maxlen-length(d)))))

R looping over two vectors

I have created two vectors in R, using statistical distributions to build the vectors.
The first is a vector of locations on a string of length 1000. That vector has around 10 values and is called mu.
The second vector is a list of numbers, each one representing the number of features at each location mentioned above. This vector is called N.
What I need to do is generate a random distribution for all features (N) at each location (mu)
After some fiddling around, I found that this code works correctly:
for (i in 1:length(mu)){
a <- rnorm(N[i],mu[i],20)
feature.location <- c(feature.location,a)
}
This produces the right output - a list of numbers of length sum(N), and each number is a location figure which correlates with the data in mu.
I found that this only worked when I used concatenate to get the values into a vector.
My question is; why does this code work? How does R know to loop sum(N) times but for each position in mu? What role does concatenate play here?
Thanks in advance.
To try and answer your question directly, c(...) is not "concatenate", it's "combine". That is, it combines it's argument list into a vector. So c(1,2,3) is a vector with 3 elements.
Also, rnorm(n,mu,sigma) is a function that returns a vector of n random numbers sampled from the normal distribution. So at each iteration, i,
a <- rnorm(N[i],mu[i],20)
creates a vector a containing N[i] random numbers sampled from Normal(mu[i],20). Then
feature.location <- c(feature.location,a)
adds the elements of that vector to the vector from the previous iteration. So at the end, you have a vector with sum(N[i]) elements.
I guess you're sampling from a series of locations, each a variable no. of times.
I'm guessing your data looks something like this:
set.seed(1) # make reproducible
N <- ceiling(10*runif(10))
mu <- sample(seq(1000), 10)
> N;mu
[1] 3 4 6 10 3 9 10 7 7 1
[1] 206 177 686 383 767 496 714 985 377 771
Now you want to take a sample from rnorm of length N(i), with mean mu(i) and sd=20 and store all the results in a vector.
The method you're using (growing the vector) is not recommended as it will be re-copied in memory each time an element is added. (See Circle 2, although for small examples like this, it's not so important.)
First, initialize the storage vector:
f.l <- NULL
for (i in 1:length(mu)){
a <- rnorm(n=N[i], mean=mu[i], sd=20)
f.l <- c(f.l, a)
}
Then, each time, a stores your sample of length N[i] and c() combines it with the existing f.l by adding it to the end.
A more efficient approach is
unlist(mapply(rnorm, N, mu, MoreArgs=list(sd=20)))
Which vectorizes the loop. Unlist is used as mapply returns a list of vectors of varying lengths.

Resources