Problem with extracting values from vector for for loop - julia

I am trying to extract values from a vector to generate random numbers from a GEV distribution. I keep getting an error. This is my code
x=rand(Truncated(Poisson(2),0,10),10)
t=[]
for i in 1:10 append!(t, maximum(rand(GeneralizedExtremeValue(2,4,3, x[i])))
I am new to this program and I think I am not passing the variable x properly. Any help will be appreciated. Thanks

If I am correctly understanding what you are trying to do, you might want something more like
x = rand(Truncated(Poisson(2),0,10),10)
t = Float64[]
for i in 1:10
append!(t, max(rand(GeneralizedExtremeValue(2,4,3)), x[i]))
end
Among other things, you were missing a paren, and probably want max instead of maximum here.
Also, while it would technically work, t = [] creates an empty array of type Any, which tends to be very inefficient, so you can avoid that by just telling Julia what type you want that array to hold with e.g. t = Float64[].
Finally, since you already know t only needs to hold ten results, you can make this again more efficient by pre-allocating t
x = rand(Truncated(Poisson(2),0,10),10)
t = Array{Float64}(undef,10)
for i in 1:10
t[i] = max(rand(GeneralizedExtremeValue(2,4,3)), x[i])
end

Related

for loop function in Seurat analysis in R

I have tried to get multiple outputs of the function I made
ratio_marker_out_2 = function(marker_gene, cluster_id){
marker_gene = list(row.names(FindMarkers(glioblastoma, ident.1 = cluster_id)))
for (gene in marker_gene){
all_cells_all_markers = glioblastoma#assays$RNA#counts[gene,]
selected_cells_all_marker = all_cells_all_markers[cluster_id!=Idents(glioblastoma)]
gene_count_out_cluster = glioblastoma#assays$RNA#counts[,cluster_id!=Idents(glioblastoma)]
ratio_out = sum(selected_cells_all_marker)/sum(gene_count_out_cluster)
}
return(ratio_out)
}
Here, the length of marker_gene is about hundreds. Let's say the length is 100. I want to get ratio_out of each gene in marker_gene. However, when running this function, I only get one output instead of a list of 100 ratio_out. Could please anyone helps how to fix it?
The output I got for
ratio_marker_out_2(marker_gene, 0)
is 1 0.5354895. Please see the pict below
It can be that sum built-in function.
By default, it returns a number. So when you do:
ratio_out = sum(selected_cells_all_marker)/sum(gene_count_out_cluster)
you're actually dividing two numerics.
So if you want to return a list, you must divide, depending on your calculations, just
ratio_out = (selected_cells_all_marker)/sum(gene_count_out_cluster)
I have solved this issue using
all_cells_all_markers[marker_gene, cluster_id!=Idents(glioblastoma)]
ratio_out = (selected_cells_all_marker)/sum(gene_count_out_cluster).

Get the mapping from each element of input to the bin of the histogram in Julia

Matlab's [n,mapx] = histc(x, bin_edged) returns the counts of x in each bin as n and returns a map, which is the same length of x which is the bin index that each element of x was placed into.
I can do the same thing in Julia as follows:
Using StatsBase
x = rand(1000)
bin_e = 0:0.1:1
h = fit(Histogram, x, bin_e)
yx = map((z) -> findnext(z.<=h.edges[1],1),x) .- 1
Is this the "right way" to do this? It seem a bit kludgy.
Inspired by this python question you should be able to define a small function that delivers the desired mapping (modulo conventions):
binindices(edges, data) = searchsortedlast.(Ref(edges), data)
Note that the bin edges are sorted and we can use seachsortedlast to get the last bin edge smaller or equal than a datapoint. Broadcasting this over all of the data we obtain the mapping. Note that the Ref(edges) indicates that edges is a scalar under broadcasting (that means that the full array is considered in each call).
Although conceptionally identical to your solution, this approach is about 13x faster on my machine.
I filed an issue over at StatsBase.jl's github page suggesting to add this as a feature.
After looking through the code for Histogram.jl I found that they already included a function binindex. So this solution is probably the best:
x = 0:0.001:10
h1 = fit(Histogram,x,0:10,closed=left)
xmap1 = StatsBase.binindex.(Ref(h1), x)
h2 = fit(Histogram,x,0:10,closed=right)
xmap2 = StatsBase.binindex.(Ref(h2), x)
I stumbled across this question when I was trying to figure out how many occurrences of each value I had in a list of values. If each value is in its own bin (as for categorical data, or integer data with a small number of unique values), this is what one would be plotting in a histogram.
If that is what you want, then countmap() in StatBase package is just what you need.

Making a looping statement that populates a vector?

I've tried a couple ways of doing this problem but am having trouble with how to write it. I think I did the first three steps correctly, but now I have to fill the vector z with numbers from y that are divisible by four, not divisible by three, and have an odd number of digits. I know that I'm using the print function in the wrong way, I'm just at a loss on what else to use ...
This is different from that other question because I'm not using a while loop.
#Step 1: Generate 1,000,000 random, uniformly distributed numbers between 0
#and 1,000,000,000, and name as a vector x. With a seed of 1.
set.seed(1)
x=runif(1000000, min=0, max=1000000000)
#Step 2: Generate a rounded version of x with the name y
y=round(x,digits=0)
#Step 3: Empty vector named z
z=vector("numeric",length=0)
#Step 4: Create for loop that populates z vector with the numbers from y that are divisible by
#4, not divisible by 3, with an odd number of digits.
for(i in y) {
if(i%%4==0 && i%%3!=0 && nchar(i,type="chars",allowNA=FALSE,keepNA=NA)%%2!=0){
print(z,i)
}
}
NOTE: As per #BenBolker's comment, a loop is an inefficient way to solve your problem here. Generally, in R, try to avoid loops where possible to maximise the efficiency of your code. #SymbolixAU has provided an example of doing so here in the comments. Having said that, in aid of helping you learn the ins-and-outs of loops and vectors, here's a solution which only requires a change to one line of your code:
You've got the vector created before the loop, that's a good start. Now, inside your loop, you need to populate that vector. To do so, you've currently got print(z,i), which won't really do too much. What you need to to change the vector itself:
z <- c( z, i )
Should work for you (just replace that print line in your loop).
What's happening here is that we're taking the existing z vector, binding i to the end of it, and making that new vector z again. So every time a value is added, the vector gets a little longer, such that you'll end up with a complete vector.
where you have print put this instead:
z <- append(z, i)

fixing race condition in tensorflow run

I would like to maintain a variable on the GPU, and perform some operations on that variable in place. The following snippet is a minimalish example of this.
import numpy as np
import tensorflow as tf
with tf.Graph().as_default():
i = tf.placeholder(tf.int32, [4], name='i')
y = tf.placeholder(tf.float32, [4], name='y')
_x = tf.get_variable('x', [4], initializer=tf.random_normal_initializer())
x = _x + tf.reduce_sum(tf.mul(_x,y))
assign_op = tf.assign(_x, x).op
permute_op = tf.assign(_x, tf.gather(_x, i))
ii = np.array([1,2,3,0])
yy = np.random.randn(4)
s = tf.Session()
s.run(tf.initialize_all_variables())
xxx0 = s.run(_x)
s.run([permute_op, assign_op], feed_dict={i: ii, y: yy})
xxx1 = s.run(_x)
print('assigned then permuted', np.allclose((xxx0+np.dot(xxx0,yy))[ii], xxx1))
print('permuted then assigned', np.allclose((xxx0[ii]+np.dot(xxx0[ii], yy)), xxx1))
The problem is that this program is ambiguous, in terms of the ordering of the assign_op and permute_op operations. Hence, one or the other of the final two print statements will be true, but which one that is varies randomly across multiple runs of the program. I could break this into two steps, the first running the permute_op and the second running the assign_op, but it seems this will be less efficient.
Is there an efficient way of breaking the race condition, and making the results predictable?
The easiest way to order the two assignments is to use the result of the first assignment as the variable input to the second one. This creates a data dependency between the assignments, which gives them a deterministic order. For example:
assigned = tf.assign(_x, x)
permuted = tf.assign(assigned, tf.gather(assigned, i))
sess.run(permuted.op) # Runs both assignments.
Note that I reversed the order of the permutation and assignment operations from what you said in your question, because doing the permutation first and then updating still has a race. Even if this isn't the semantics you wanted, the principle should hopefully be clear.
An alternative approach is to use with tf.control_dependencies(ops): blocks, where ops is a list of operations (such as assignments) that must run before the operations in the with block. This is slightly trickier to use, because you have to be careful about reading the updated value of a variable. (Like a non-volatile variable in C, the read may be cached.) The typical idiom to force a read is to use tf.identity(var.ref()), so the example would look something like:
assign_op = tf.assign(_x, x).op
with tf.control_dependencies([assign_op]):
# Read updated value of `_x` after `assign_op`.
new_perm = tf.gather(tf.identity(_x.ref()), i)
permute_op = tf.assign(_x, new_perm).op
sess.run(permute_op) # Runs both assignments.

Double "for loops" in a dataframe in R

I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.

Resources