I am making a tensorflow graph, and the final step is a simple mean squared error calculation. Some of the inputs are padding, which I am masking out using the tensor loss_mask. This tensor is mostly 1's, with a few 0's at the end.
The following operation calculates the mean squared error, where the masked out inputs count as zero error.
error_squared = loss_mask * tf.pow(expected_outputs_reshaped - network_outputs, 2)
loss = tf.reduce_mean(error_squared)
The problem is that, if many outputs are masked out, then the average loss seems much lower than it should, since these masked out inputs count as zero error. So, I tried the following code to correct this:
error_squared = loss_mask * tf.pow(expected_outputs_reshaped - network_outputs, 2)
loss = tf.reduce_sum(error_squared) / tf.reduce_sum(loss_mask)
If loss_mask is all 1's, then these two things should result in the same thing. However, they are not. For the first, I am getting correct values, which are all between 0 and 1. For the second, I am getting values in the hundreds. Why are these two not the same thing?
I have verified that tf.reduce_sum(loss_mask) equals the number of elements in error_squared.
Related
I would like to know why im getting this error running metaMDS:
'comm' has negative data: 'autotransform', 'noshare' and 'wascores' set to FALSE
I would like to do NMDS and dendogram graphs but can do so with the error above.
My data set is available for download if anyone wants to check DATASET. After importing the data, I transposed the column and rows. Afterwhich, I replaced the NA values with O before trying to run metaMDS.
abundance <- read.table("1_abundance.txt", header = TRUE)
abundance[is.na(abundance)] <- 0
abundance_trans <- t(abundance)
metaMDS(abundance_trans, distance = "bray", k = 2, trymax = 50)
It is not an error message but information: metaMDS tells you that you have negative data entries, and it will not make some tricks that it defaults to do with non-negative data.
Second issue is that you ask for Bray-Curtis dissimilarities that are only applicable with non-negative data.
You have two alternatives: either take care of negative values, or use a dissimilarity measure than can handle them. If you think that you do not have negative data, you are wrong: computer knows. You may have an error when reading in your data, and you may have columns or rows that you should not have. Check your data.
Given is a function F1:
F1 <- function(C1,C2,C3,...,x,u_target) {
# a lot of equations follow
...
u_actual - u_target
}
F1 returns the result of the very last equation
u_actual - u_target
I want to determine the value for the parameter x in a way that the result of the last equation converges to zero. With
nlm(f=F1,p=c(0),C1=C1,C2=C2,...,stepmax=0.001,ndigit=8)
I get a result, but not a satisfying one:
u_actual = 0.1316566
u_target = 0.1
I played a lot with the arguments of the nlm command (gradtol,stepmax,iterlim etc.), but I was not able to get a better result. I also tried optim, optimize and uniroot, but was not able to get them run at all.
u and x show a negative exponential development. With decreasing x, u increases exponential. If x is zero, u results in a finite value. x also has an upper boundary, which is unknown. So I guessed it would be promising if the iteration starts at the lower boundary (zero) and increases step by step. However, whether I decrease or increase the value of stepmax, the result is not getting better.
I would appreciate any hint from the r-community.
Thank you very much.
PS: in matlab a colleague uses fsolve(#(x) F1(x,u_target,C1,C2,...),0), and it works fine.
I have a question about a result which I did not expect when doing PCA.
I have successfully calculated the principal components using reference data, and then as a check to ensure that what's going on is what I think is going on, I've projected the reference data onto the entire basis of its eigenfucntions (kept all components) and then transformed back, (this is in python, so it's pca.fit(ref_data) followed by ref_data_transform =pca.transform.(ref_data) followed by pca.inverse_transform(ref_data_transform) I get the exact same data. This is not a surprise.
What is also not a surprise is that as I choose fewer and fewer principle components, the point to point difference between the original data and that which has been projected onto a smaller basis and then projected back increases. That is, if you plot the original data and "filtered" data, it looks different, with the difference increasing as you reduce the size of the subspace onto which you're projecting. I can capture the difference between each data point in a vector called, say, difference_vec.
What IS a surprise (to me at least) is that when I sum over any column of difference_vec it always equals zero. That is, while the actual differences between any original data point and the corresponding one filtered by some number of principal components grow larger as I project onto a smaller and smaller subspace, the TOTAL error is always zero.
I very much appreciate any insight that one my have into if I'm making some mistake here and if not, why this erstwhile "projection induced error" metric doesn't work.
Thanks.
This happens because ref_data and what I’ll call inv_data = pca.inverse_transform(pca.transform(ref_data)) both have the same mean (taken along the second dimension, i.e., averaging over samples).
To see this, take a look at the code for transform:
transform = lambda X: dot(X - mu, V.T)
whereas inverse_transform can be defined as:
inverse_transform = lambda X: dot(X, V) + mu
where mu is the mean of ref_data and V are the first N eigenvectors of covariance(ref_data).
So if you follow the chain of data and its mean:
ref_data with mean mu;
transform(ref_data) has mean 0 (see the equivalent definition above: X-mu has zero mean, then projecting the result linearly onto some coordinate reference only rotates/shears/flips those zero-mean points, doesn’t alter their mean;
Finally, inv_data = inverse_transform(transform(ref_data)) adds mu back so it has mu-mean;
you see that ref_data and inv_data both have mean mu.
Finally, sum(ref_data - inv_data) can be seen as sum(mean(ref_data - inv_data) * num_samples), which by linearity simplifies to sum(mu - mu), which is 0.
That’s a lot of words, sorry, but the idea, now that I see it, is really simple. As I mentioned in my comment, in cases like this you want to use a matrix norm, like the Frobenius norm, to measure a distance between two matrixes, not just sum(A - B) 😅!
Sample code:
import numpy as np
from sklearn.decomposition import PCA
ref_data = np.random.randn(20, 3)
pca = PCA(n_components=1)
pca.fit(ref_data)
trans_data = pca.transform(ref_data)
inv_data = pca.inverse_transform(trans_data)
np.mean(inv_data, 0) # array([ 0.03664149, 0.51348007, 0.0360179 ])
np.mean(ref_data, 0) # array([ 0.03664149, 0.51348007, 0.0360179 ])
np.mean(trans_data, 0) # array([ -2.49800181e-17]) meanwhile ...
np.sum(inv_data - ref_data) # -1.3877787807814457e-15 !
I am attempting to use predict with a loess object in R. There are 112406 observations. There is one particular line inside stats:::predLoess which attempts to multiply N*M1 where N=M1=112406. This causes an integer overlow and the function bombs out. The line of code that does this is the following (copied from predLoess source):
L <- .C(R_loess_ise, as.double(y), as.double(x), as.double(x.evaluate[inside,
]), as.double(weights), as.double(span), as.integer(degree),
as.integer(nonparametric), as.integer(order.drop.sqr), as.integer(sum.drop.sqr),
as.double(span * cell), as.integer(D), as.integer(N), as.integer(M1),
double(M1), L = double(N * M1))$L
Has anyone solved this or found a solution to this problem? I am using R 2.13. The name of this forum is fitting for this problem.
It sounds like you're trying to get predictions for all N=112406 observations. First, do you really need to do this? For example, if you want graphical output, it's faster just to get predictions on a small grid over the range of your data.
If you do need 112406 predictions, you can split your data into subsets (say of size 1000 each) and get predictions on each subset independently. This avoids forming a single gigantic matrix inside predLoess.
Consider a vector V riddled with noisy elements. What would be the fastest (or any) way to find a reasonable maximum element?
For e.g.,
V = [1 2 3 4 100 1000]
rmax = 4;
I was thinking of sorting the elements and finding the second differential {i.e. diff(diff(unique(V)))}.
EDIT: Sorry about the delay.
I can't post any representative data since it contains 6.15e5 elements. But here's a plot of the sorted elements.
By just looking at the plot, a piecewise linear function may work.
Anyway, regarding my previous conjecture about using differentials, here's a plot of diff(sort(V));
I hope it's clearer now.
EDIT: Just to be clear, the desired "maximum" value would be the value right before the step in the plot of the sorted elements.
NEW ANSWER:
Based on your plot of the sorted amplitudes, your diff(sort(V)) algorithm would probably work well. You would simply have to pick a threshold for what constitutes "too large" a difference between the sorted values. The first point in your diff(sort(V)) vector that exceeds that threshold is then used to get the threshold to use for V. For example:
diffThreshold = 2e5;
sortedVector = sort(V);
index = find(diff(sortedVector) > diffThreshold,1,'first');
signalThreshold = sortedVector(index);
Another alternative, if you're interested in toying with it, is to bin your data using HISTC. You would end up with groups of highly-populated bins at both low and high amplitudes, with sparsely-populated bins in between. It would then be a matter of deciding which bins you count as part of the low-amplitude group (such as the first group of bins that contain at least X counts). For example:
binEdges = min(V):1e7:max(V); % Create vector of bin edges
n = histc(V,binEdges); % Bin amplitude data
binThreshold = 100; % Pick threshold for number of elements in bin
index = find(n < binThreshold,1,'first'); % Find first bin whose count is low
signalThreshold = binEdges(index);
OLD ANSWER (for posterity):
Finding a "reasonable maximum element" is wholly dependent upon your definition of reasonable. There are many ways you could define a point as an outlier, such as simply picking a set of thresholds and ignoring everything outside of what you define as "reasonable". Assuming your data has a normal-ish distribution, you could probably use a simple data-driven thresholding approach for removing outliers from a vector V using the functions MEAN and STD:
nDevs = 2; % The number of standard deviations to use as a threshold
index = abs(V-mean(V)) <= nDevs*std(V); % Index of "reasonable" values
maxValue = max(V(index)); % Maximum of "reasonable" values
I would not sort then difference. If you have some reason to expect continuity or bounded change (the vector is of consecutive sensor readings), then sorting will destroy the time information (or whatever the vector index represents). Filtering by detecting large spikes isn't a bad idea, but you would want to compare the spike to a larger neighborhood (2nd difference effectively has you looking within a window of +-2).
You need to describe formally the expected information in the vector, and the type of noise.
You need to know the frequency and distribution of errors and non-errors. In the simplest model, the elements in your vector are independent and identically distributed, and errors are all or none (you randomly choose to store the true value, or an error). You should be able to figure out for each element the chance that it's accurate, vs. the chance that it's noise. This could be very easy (error data values are always in a certain range which doesn't overlap with non-error values), or very hard.
To simplify: don't make any assumptions about what kind of data an error produces (the worst case is: you can't rule out any of the error data points as ridiculous, but they're all at or above the maximum among non-error measurements). Then, if the probability of error is p, and your vector has n elements, then the chance that the kth highest element in the vector is less or equal to the true maximum is given by the cumulative binomial distribution - http://en.wikipedia.org/wiki/Binomial_distribution
First, pick your favorite method for identifying outliers...
If you expect the numbers to come from a normal distribution, you can use a say 2xsd (standard deviation) above the mean to determine your max.
Do you have access to bounds of your noise-free elements. For example, do you know that your noise-free elements are between -10 and 10 ?
In that case, you could remove noise, and then find the max
max( v( find(v<=10 & v>=-10) ) )