Trouble with a loop statement in R - r

I am having trouble writing the proper R code to perform a looped, if else, conditional test. I am trying to solve for x (must be a whole number), such that F_c = 5 (see below). Both z and w are a series of known values, with z representing abundance values and w representing area sampled. Right now I am essentially entering random values for x to see how close I can get to F_c = 5. I would like to write a loop for this, and also have the loop stop when an iteration of x results in F_c = 5. Any help would be very appreciated, I have spent a lot of time on this and haven't found a similar question posted yet (but if there is one please direct me to the solution). Thanks,
cond = ifelse(z<=x, 1, 0)
F_c = 100*(sum(w*z*cond)/sum(w*z))

Not much clear, but I'd assume you want to know at which point the cumulative sum of w*z reaches the five per cent of sum(w*z), while following the order of z. If that's correct, you can try this:
#for every z get the order indices
indices<-order(z)
#sort both z and w by z
z<-z[indices]
w<-w[indices]
#now cumsum will give you the cumulative sum of a vector
#and you compare it to sum(z*w).
#findInterval will give you the index of when you reach .05
res<-findInterval(.05,cumsum(w*z)/sum(w*z))
#the value you are looking for:
z[res]

Related

Finding the Proportion of a specific difference between the average of two vectors

I have a question for an assignment I'm doing.
Q:
"Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages.
What proportion of these 1,000 averages are more than 1 gram away from the average of x ?"
I understand that basically, I need to write a code that says: What percentage of "Nulls" is +or- 1 gram from the average of "x." I'm not really certain how to write that given that this course hasn't given us the information on how to do that yet is asking us to do so. Any help on how to do so?
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )
set.seed(1)
n <- 1000
nulls<-vector("numeric", n)
for(i in 1:n){
control <- sample(x, 5)
nulls[i] <-mean(control)
##I know my last line for this should be something like this
## mean(nulls "+ or - 1")> or < mean(x)
## not certain if they're asking for abs() to be involved.
## is the question asking only for those that are 1 gram MORE than the avg of x?
}
Thanks for any help.
Z
I do think that the absolute distance is what they're after here.
Vectors in R are nice in that you can just perform arithmetic operations between a vector and a scalar and it will apply it element-wise, so computing the absolute value of nulls - mean(x) is easy. The abs function also takes vectors as arguments.
Logical operators (such as < and >) can also be used in the same way, making it equally simple to compare the result with 1. This will yield a vector of booleans (TRUE/FALSE) where TRUE means the value at that index was indeed greater than 1, but booleans are really just numbers (1 or 0), so you can just sum that vector to find the number of TRUE elements.
I don't know what programming level you are on, but I hope this helps without giving the solution away completely (since you said it's for an assignment).

How to get a single number result for a big matrix using var function?

Using the var function,
(a) find the sample variance of your row averages from above;
(b) find the sample variance for your XYZmat as a whole; <-this
(c) Divide the sample variance of the XYZmat by the sample variance of the row averages. The statistical theory says that ratio will on average be close to the row sample size, which is n, here.
(d) Do your results agree with theory? (That is a non-trivial question.) Show your work.
So this is what he asked for in the question, I could not get the single number result, so I just used the sd function and then squared the result. I keep wondering if there is still a way to get a single number result using var function. In my case n is 30, I got it from the previous part of the homework. This is the first R class I am taking and this is the first homework assigned, so the answer should be pretty simple.
I tried as.vector() function and I still got the set of numbers as a result. I played around with var function, no changes.
Unfortunately, I deleted all the code I had since the matrix is so big that my laptop started lagging.
I did not have any error messages, but I kept getting a set of numbers for the answer...
set.seed(123)
XYZmat <- matrix(runif(10000), nrow=100, ncol=100) # make a matrix
varmat <- var(as.vector(XYZmat)) # variance of whole matrix
n <- nrow(XYZmat) # number of rows
n
#> [1] 100
rowmeans <- rowMeans(XYZmat) # row means
varmat/var(rowmeans) # should be near n
#> [1] 100.6907
Created on 2019-07-17 by the reprex package (v0.3.0)

Count duration of value in vector in R

I am trying to count the length of occurrances of a value in a vector such as
q <- c(1,1,1,1,1,1,4,4,4,4,4,4,4,4,4,4,4,4,6,6,6,6,6,6,6,6,6,6,1,1,4,4,4)
Actual vectors are longer than this, and are time based. What I would like would be an output for 4 that tells me it occurred for 12 time steps (before the vector changes to 6) and then 3 time steps. (Not that it occurred 15 times total).
Currently my ideas to do this are pretty inefficient (a loop that looks element by element that I can have stop when it doesn't equal the value I specified). Can anyone recommend a more efficient method?
x <- with(rle(q), data.frame(values, lengths)) will pull the information that you want (courtesy of d.b. in the comments).
From the R Documentation: rle is used to "Compute the lengths and values of runs of equal values in a vector – or the reverse operation."
y <- x[x$values == 4, ] will subset the data frame to include only the value of interest (4). You can then see clearly that 4 ran for 12 times and then later for 3.
Modifying the code will let you check whatever value you want.

Expected value of the difference between a sum of variables and a threshold

I had a custom deck consisting of eight cards of the sequence 2^n, n=0,..,6. I draw cards (without replacement) until the sum is equal or greater than the threshold. How can I implement in R a function that calculates the mean of the difference between the sum and the threshold??
I tried to do it using this How to store values in a vector with nested functions
but it takes ages... I think there is a way to do it with probabilities/simulations but I can figure out.
The threshold could be greater than the value of one single card, ie, threshold=500 or less than the value of a single card, ie, threshold=50
What I have done so far is to find all the subsets that meet the condition of the sum greater or equal to the threshold. Then I will only substract the threshold and calculate the mean.
I am using the following code in R. For a small set I get the answer quite fast. However, I have been running the function for several ours with the set containing the 56 numbers and is still working.
set<-c(rep(1,8),rep(2,8), rep(4,8),rep(8,8),rep(16,8),rep(32,8),rep(64,8))
recursive.subset <-function(x, index, current, threshold, result){
for (i in index:length(x)){
if (current + x[i] >= threshold){
store <<- append(store, sum(c(result,x[i])))
} else {
recursive.subset(x, i + 1, current+x[i], threshold, c(result,x[i]))
}
}
}
store <- vector()
inivector <- vector(mode="numeric", length=0) #initializing empty vector
recursive.subset (set, 1, 0, threshold, inivector)
I don't know if it is possible to get an exact solution, simply because there are so many possible combinations. It is probably better to do simulations, i.e. write a script for 1 full draw and then rerun that script many times. Since the solutions are very similar, the simulation should give a pretty good approximation.
Ok, here goes:
set <- rep(2^(0:6), each = 8)
thr <- 500
fun <- function(set,thr){
x <- cumsum(sample(set))
value <- x[min(which(x >= thr))]
value
}
system.time(a <- replicate(100000, fun(set,thr)))
# user system elapsed
# 1.10 0.00 1.09
mean(a - thr)
# [1] 21.22992
Explanation: Rather than drawing a card one at a time, I draw all cards simultaneously (sample) and then calculate the cumulative sum (cumsum). I then find the point where the cards at up to the threshold or larger, and find the sum of those cards back in x. We run this function many times with replicate, to obtain a vector of the values. We use mean(a-thr) to calculate the mean difference.
Edit: Made a really stupid typo in the code, fixed it now.
Edit2: Shortened the function a little.

Summing R Matrix ignoring NA's

I have the following claim counts data (triangular) by limits:
claims=matrix(c(2019,690,712,NA,773,574,NA,NA,232),nrow=3, byrow=T)
What would be the most elegant way to do the following simple things resembling Excel's sumif():
put the matrix into as.data.frame() with column names: "100k", "250k", "500k"
sum all numbers except first row; (in this case summing 773,574, and 232). I am looking for a neat reference so I can easily generalize the notation to larger claim triangles.
Sum all numbers, ignoring the NA's. sum(claims, na.rm = T) - Thanks for Gregor's suggestion.
*I played around with the package ChainLadder a bit and enjoyed how it handles triangular data, especially in plotting and calculating link ratios. I wonder more generally if basic R suffices in doing some quick and dirty sumif() or pairwise link ratio kind of calculations? This would be a bonus for me if anyone out there could dispense some words of wisdom.
Thank you!
claims=matrix(c(2019,690,712,NA,773,574,NA,NA,232),nrow=3, byrow=T)
claims.df = as.data.frame(claims)
names(claims.df) <- c("100k", "250k", "500k")
# This isn't the best idea because standard column names don't start with numbers
# If you go non-standard, you'll have to always quote them, that is
claims.df$100k # doesn't work
claims.df$`100k` # works
# sum everything
sum(claims, na.rm = T)
# sum everything except for first row
sum(claims[-1, ], na.rm = T)
It's much easier to give specific advice to specific questions than general advice. As to " I wonder more generally if basic R suffices in doing some quick and dirty sumif() or pairwise link ratio kind of calculations?", at least as to the sumif comment, I'm reminded of fortunes::fortune(286)
...this is kind of like asking "will your Land Rover make it up my driveway?", but I'll assume the question was asked in all seriousness.
sum adds up whatever numbers you give it. Subsetting based on logicals so simple that there is no need for a separate sumif function. Say you have x = rnorm(100), y = runif(100).
# sum x if x > 0
sum(x[x > 0])
# sum x if y < 0.5
sum(x[y < 0.5])
# sum x if x > 0 and y < 0.5
sum(x[x > 0 & y < 0.5])
# sum every other x
sum(x[c(T, F)]
# sum all but the first 10 and last 10 x
sum(x[-c(1:10, 91:100)]
I don't know what a pairwise link ratio is, but I'm willing to bet base R can handle it easily.

Resources