I have a list lets say of 4 items, I need to cut down that list to 2 by taking any two elements of that list and finding the average of that list.
This is the algorithm I came up with, I do not know how to write this in R.
choose an x_i
choose an x_j not equal to x_i
find the average of x_i and x_j
choose a new x_(i+1) and x_(j+1) as long as they are not equal to x_i or x_j
for example:
x <- c(2,4,6,8)
y <- c((2+4)/2,(6+8)/2) or c((2+6)/2,(2+8)/2) or anything similar to that.
For the sake of closing this question as answered, we can use the following syntax to do what we need to do: replicate(2, mean(sample(x, 2)))
Related
I have a question for an assignment I'm doing.
Q:
"Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages.
What proportion of these 1,000 averages are more than 1 gram away from the average of x ?"
I understand that basically, I need to write a code that says: What percentage of "Nulls" is +or- 1 gram from the average of "x." I'm not really certain how to write that given that this course hasn't given us the information on how to do that yet is asking us to do so. Any help on how to do so?
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )
set.seed(1)
n <- 1000
nulls<-vector("numeric", n)
for(i in 1:n){
control <- sample(x, 5)
nulls[i] <-mean(control)
##I know my last line for this should be something like this
## mean(nulls "+ or - 1")> or < mean(x)
## not certain if they're asking for abs() to be involved.
## is the question asking only for those that are 1 gram MORE than the avg of x?
}
Thanks for any help.
Z
I do think that the absolute distance is what they're after here.
Vectors in R are nice in that you can just perform arithmetic operations between a vector and a scalar and it will apply it element-wise, so computing the absolute value of nulls - mean(x) is easy. The abs function also takes vectors as arguments.
Logical operators (such as < and >) can also be used in the same way, making it equally simple to compare the result with 1. This will yield a vector of booleans (TRUE/FALSE) where TRUE means the value at that index was indeed greater than 1, but booleans are really just numbers (1 or 0), so you can just sum that vector to find the number of TRUE elements.
I don't know what programming level you are on, but I hope this helps without giving the solution away completely (since you said it's for an assignment).
I am supposed to find the mean and standard deviation at each given sample size (N), using the "FOR LOOP". I started writing the code as below, I am required to save all the means into vector "p". How do I save all the means into one vector?
sample.sizes =c(3,10,50,100,500,1000)
mean.sds = numeric(0)
for ( N in sample.sizes ){
x <- rnorm(3,mean=0,sd=1)
mean.sds[i]
}
mean(x)
Actually you are doing many thing wrong?
If you are using variable N in for loop, you are not using it anywhere
for (N in 'some_vector') actually means N will take that value one by one. So N in sample sizes will first take, 3 then 10 then 50 and so on.
Now where does i come into picture?
You are calculating x for each iteration of N. In fact you are not using N anywhere in the loop?
first x will return 3 values. In the next line you intend to store these three values in just ith value of mean.sds where i is unknown and storing three values into one value, as it is, is not logically possible.
Do you want this?
sample.sizes =c(3,10,50,100,500,1000)
mean.sds = numeric(0)
for ( i in seq_along(sample.sizes )){
x <- rnorm(sample.sizes[i], mean=0, sd=1)
mean.sds[i] <- mean(x)
}
mean.sds
[1] 0.6085489531 -0.1547286299 0.0052106559 -0.0452804986 -0.0374094936 0.0005667246
I replaced N with seq_along(sample.sizes) which will give iterations equal to the number of that vector. Six in this example.
I passed each ith element to first argument of rnorm to generate these many random values.
Stored each random value into single vector. calculated its mean (one value only) and stored in ith value of your empty vector.
I want to make a new matrix B from a previous matrix A, where the length of rows and columns are the same in B and every position corresponds to a ranking of A.
In particular, for any x of a location [i,j] in A, I want to find how many values are greater than [i,j] (which sum(A>x), which I can find when x is discrete, but not for any x), followed by division by the total number of observations*variables in the matrix A.
I think using the apply function would be able to create matrix B as I wish, but I'm having trouble finding a way to apply use of "sum" for each position (i.e., sum(A>x)/# of positions in A.
I think I could use apply(A, c(1,2), FUN(X...)), but I do not know what function I can use.
Thanks for any suggestions.
Short version: matrix((length(M) - rank(M))/length(M), nrow=nrow(M), ncol=ncol(M))
Long version:
length(M) will give you the number of elements in the matrix.
length(M) - rank(M) will give the number of elements greater than each element.
So you want (length(M) - rank(M)) / length(M) but formatted into a matrix like M, so
matrix((length(M) - rank(M))/length(M), nrow=nrow(M), ncol=ncol(M))
I am having trouble writing the proper R code to perform a looped, if else, conditional test. I am trying to solve for x (must be a whole number), such that F_c = 5 (see below). Both z and w are a series of known values, with z representing abundance values and w representing area sampled. Right now I am essentially entering random values for x to see how close I can get to F_c = 5. I would like to write a loop for this, and also have the loop stop when an iteration of x results in F_c = 5. Any help would be very appreciated, I have spent a lot of time on this and haven't found a similar question posted yet (but if there is one please direct me to the solution). Thanks,
cond = ifelse(z<=x, 1, 0)
F_c = 100*(sum(w*z*cond)/sum(w*z))
Not much clear, but I'd assume you want to know at which point the cumulative sum of w*z reaches the five per cent of sum(w*z), while following the order of z. If that's correct, you can try this:
#for every z get the order indices
indices<-order(z)
#sort both z and w by z
z<-z[indices]
w<-w[indices]
#now cumsum will give you the cumulative sum of a vector
#and you compare it to sum(z*w).
#findInterval will give you the index of when you reach .05
res<-findInterval(.05,cumsum(w*z)/sum(w*z))
#the value you are looking for:
z[res]
I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.