I'm following an Econometrics course in which we use R.
I'm a bit confused about the use of the while-loop for an exercise.
The situation is as follows:
I have a variable 'profit_loss' with a length of 10,000. The variable gives an overview of the financial performance of a company with its profits (positive values) and losses (negative values). From this I have to create another variable for which I have to create a minimum benchmark such that the mean of this variable becomes 20% higher than the mean of the original variable. In other words, I have to cut the losses to a certain point, which will increase the mean to a level where it's 20% higher than the original mean.
I was thinking of using a while-loop but while I can run some easy loops with this, I can't seem to figure out this one.
Any suggestions?
Thank you so much!
Related
I am using R (RStudio) to construct an index/synthetic indicator to evaluate, say, commercial efficiency. I am using the PCA() command from factorMineR package, and using 7 distinct variables. I have previously created similar indexes by calculating the weight of each particular variable over the first component (which can be obtained through PCA()$var$coord[,1]), with no problems, since each variable has a positive weight. However, there is one particular variable that has a weight with an undesired sign: negative. The variable is ‘delivery speed’ and this sign would imply that the greater the speed the less efficient the process. Then, what is going on? How would you amend this issue, preferably still using PCA?
The sign of variable weights shouldn't matter in PCA. Since on the whole, all of the components perfectly represent the original data (when p < n), for some components it is natural that there will be some positive weights and some negative weights. That doesn't mean that that particular variable has an undesired weight, rather that for that particular extracted signal (say, first principal component) the variable weight is negative.
For a better understanding, let's take the classical 2 dimensional example, which I took from this very useful discussion:
Can you see from the graph that one of the weights will necessary be negative for the 2nd principal component?
Finally, if that variable does actually disturb your analysis, one possible solution would be to apply Sparse PCA. Under cross-validated regularization that method is able to make some of the weights equal to zero. If in your case that negative weight is not significant enough, it might get reduced to zero under SPCA.
Given a data set with several records, that are similar to this one:
I want to detect the green dots. This pattern is recurring in a lot of the data records but is not completely identical (sd, variance, min, max etc). This data points are near the minimum and are showing a low variance.
I tried clustering (kmeans, dbscan, mclust) but the result was not very good.
How can I solve this problem? Any ideas?
Dare I say a simple threshold based on the minimum and a percentage?
There is a dataset which contains aggregated data - aggregated to various dimensions, and down to the hourly level. The main measure is speed which is simply the file size divided by the duration.
The requirement is to see Percentile, Median and Average/Mean summaries.
Mean is simple because we simply create a calculated measure in the MDX and then it works at all aggregation levels i.e. daily/monthly etc.
However Percentile and median are hard. Is there any way in which it is possible to have a calculation for these functions which will roll up correctly? We could add the percentile speed as a column in the ETL when we're reading the raw data, but we'd still need to find a way to then roll it up further?
What is the proper way to roll up these types of measures? It's not uncommon to ask for percentile numbers, so I'm surprised to not see much information on this when I look around.
Maybe the only approach is to have various aggregated tables at the right level, with the right calculation, and then make mondrian use them as agg tables? Or worse case have multiple cubes (!)
OK, so it turns out you cannot roll up percentiles ( and therefore medians which is just a 50th Percentile ) I understand others have had this problem, see this tweet from Kasper here: https://twitter.com/kaspersor/status/308189242788560896
So our solution was a couple of different agg tables to store the relevant stats, and on the main (already aggregated) fact table to store the pre-computed percentile and median stats.
I have a data set with 20 classes, and it has a pretty non-uniform distribution. Is there any functionality in R that allows us to balance the data set (weighted perhaps)?
I want to use the balanced data with Weka for classification. Since my class distribution is skewed, I am hoping to get better results if there's no single majority class.
I have tried to use the SMOTE filter and Resample filter but they don't quite do what I want.
I dont want any instances to be removed, repetition is fine.
I think there's a misunderstanding in your terminology. Your question's title refers to sampling, and yet the question text involves weighting.
To clarify:
With sampling, you either have fewer, the same, or more instances than in the original set; the unique membership of a sample can be either a strict subset of the original set or can be identical to the original set (with replacement - i.e., duplicates).
By weighting, you simply adjust weights that may be used for some further purpose (e.g. sampling, machine learning) to address or impose some (im)balance relative to a uniform weighting.
I believe that you are referring to weighting, but the same answer should work in both cases. If the total # of observations is N and the frequency of each class is an element of the 20-long vector freq (e.g. the count of items in class 1 is freq[1]*N), then simply use a weight vector of 1/freq to normalize the weights. You can scale it by some constant, e.g. N, though it wouldn't matter. In case any frequency is 0 or very close to it, you might address this by using a vector of smoothed counts (e.g. Good-Turing smoothing).
As a result, each set will have an equal proportion of the total weight.
I'd like to combine a few metrics of nodes in a social network graph into a single value for rank ordering the nodes:
in_degree + betweenness_centrality = informal_power_index
The problem is that in_degree and betweenness_centrality are measured on different scales, say 0-15 vs 0-35000 and follow a power law distribution (at least definitely not the normal distribution)
Is there a good way to rescale the variables so that one won't dominate the other in determining the informal_power_index?
Three obvious approaches are:
Standardizing the variables (subtract mean and divide by stddev). This seems it would squash the distribution too much, hiding the massive difference between a value in the long tail and one near the peak.
Re-scaling variables to the range [0,1] by subtracting min(variable) and dividing by max(variable). This seems closer to fixing the problem since it won't change the shape of the distribution, but maybe it won't really address the issue? In particular the means will be different.
Equalize the means by dividing each value by mean(variable). This won't address the difference in scales, but perhaps the mean values are more important for the comparison?
Any other ideas?
You seem to have a strong sense of the underlying distributions. A natural rescaling is to replace each variate with its probability. Or, if your model is incomplete, choose a transformation that approximately acheives that. Failing that, here's a related approach: If you have a lot of univariate data from which to build a histogram (of each variate), you could convert each to a 10 point scale based on whether it is in the 0-10% percentile or 10-20%-percentile ...90-100% percentile. These transformed variates have, by construction, a uniform distribution on 1,2,...,10, and you can combine them however you wish.
you could translate each to a percentage and then apply each to a known qunantity. Then use the sum of the new value.
((1 - (in_degee / 15) * 2000) + ((1 - (betweenness_centrality / 35000) * 2000) = ?
Very interesting question. Could something like this work:
Lets assume that we want to scale both the variables to a range of [-1,1]
Take the example of betweeness_centrality that has a range of 0-35000
Choose a large number in the order of the range of the variable. As an example lets choose 25,000
create 25,000 bins in the original range [0-35000] and 25,000 bins in the new range [-1,1]
For each number x-i find out the bin# it falls in the original bin. Let this be B-i
Find the range of B-i in the range [-1,1].
Use either the max/min of the range of B-i in [-1,1] as the scaled version of x-i.
This preserves the power law distribution while also scaling it down to [-1,1] and does not have the problem as experienced by (x-mean)/sd.
normalizing to [0,1] would be my short answer recommendation to combine the 2 values as it will maintain the distribution shape as you mentioned and should solve the problem of combining the values.
if the distribution of the 2 variables is different which sounds likely this won't really give you what i think your after, which is a combined measure of where each variable is within its given distribution. you would have to come up with a metric which determines where in the given distribution the value lies, this could be done many ways, one of which would be to determine how many standard deviations away from the mean the given value is, you could then combine these 2 values in some way to get your index. (addition may no longer be sufficient)
you'd have to work out what makes the most sense for the data sets your looking at. standard deviations may well be meaningless for your application, but you need to look at statistical measures that related to the distribution and combine those, rather than combing absolute values, normalized or not.