I have two situations:
Situation A:
0.4588
Situation B:
0.0021, 0.0128, 0.0072
As situation A has one value and situation B has three values, I want to generate situation C with only a single value that represents / scale all these four values.
I tried with normalization of data but my conditions is different (as I thought). Is there any way to scale / normalize this condition?
Thanks,
Waqas.
Related
I think I have a rather simple problem but I can't figure out the best approach. I have a vector with 30 different values. Now I need to divide the vector into 10 groups in such a way that the mean within group variance is as small as possible. the size of the groups is not important, it can anything between one and 21.
Example. Let's say I have vector of six values, that I have to split into three groups:
Myvector <- c(0.88,0.79,0.78,0.62,0.60,0.58)
Obviously the solution would be:
Group1 <-c(0.88)
Group2 <-c(0.79,0.78)
Group3 <-c(0.62,0.60,0.58)
Is there a function that gives the same outcome as the example and that I can use for my vector withe 30 values?
Many thanks in advance.
It sounds like you want to do k-means clustering. Something like this would work
kmeans(Myvector,3, algo="Lloyd")
Note that I changed the default algorithm to match your desired output. If you read the ?kmeans help page you will see that there are different algorithms to calculate the different clusters because it's not a trivial computational problem. They might necessarily guarantee optimality.
My data consists of two raters interpreting one specific phenomenon to occur at different points in time. I have two questions:
1) What do I call these data? "Time-series data" seems too general and usually refers to metric data changing continuously over time (while I have just points of data along the time line). Under "time-point data" I don't find problems of the kinds described in question (2).
2) What indices for interrater reliability can I use - preferably in R? (If an index requires defining how much offset is tolerated, that could be 0.120 seconds.)
example data (in seconds)
rater1:
181.23
181.566
181.986
182.784
183.204
191.352
193.956
195.426
197.568
197.82
198.576
202.02
205.8
206.136
208.53
209.034
216.216
220.08
220.584
230.706
238.266
238.518
239.442
241.5
241.836
244.398
rater2:
181.902
182.784
183.204
193.956
195.384
197.694
197.82
198.576
199.5
202.146
205.8
206.136
208.53
216.258
219.576
220.542
222.096
222.558
226.002
228.312
229.11
230.244
230.496
230.832
231.504
232.554
238.266
238.518
238.602
238.938
241.5
241.836
244.272
I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.
I have a fairly simple data set which I'm analyzing with glmmPQL(), then using glht() for Tukey comparisons of the groups. Group A has data with mostly numbers and a few zero values, Group B has all zeros except for one data value, and Group C is all zeros. (I'm doing this analysis on such a sparse data set to match the work done on a more robust data set.)
my glht()/Tukey results say that A and B are different, but not A and C. This makes no sense to me, as C is slightly further from A than B is.
Also, if I change group C by changing one of the zeros to a number (1), then glht()/Tukey sees the statistical difference between A and C, even though I actually moved the two groups closer together.
Does anyone understand why this happens?
I would like to unit test the time writing software used at my company. In order to do this I would like to create sets of random numbers that add up to a defined value.
I want to be able to control the parameters:
Min and max value of the generated number
The n of the generated numbers
The sum of the generated numbers
For example, in 250 days a person worked 2000 hours. The 2000 hours have to randomly distributed over the 250 days. The maximum time time spend per day is 9 hours and the minimum amount is .25
I worked my way trough this SO question and found the method
diff(c(0, sort(runif(249)), 2000))
This results in 1 big number a 249 small numbers. That's why I would to be able to set min and max for the generated number. But I don't know where to start.
You will have no problem meeting any two out of your three constraints, but all three might be a problem. As you note, the standard way to generate N random numbers that add to a sum is to generate N-1 random numbers in the range of 0..sum, sort them, and take the differences. This is basically treating your sum as a number line, choosing N-1 random points, and your numbers are the segments between the points.
But this might not be compatible with constraints on the numbers themselves. For example, what if you want 10 numbers that add to 1000, but each has to be less than 100? That won't work. Even if you have ranges that are mathematically possible, forcing compliance with all the constraints might mean sacrificing uniformity or other desirable properties.
I suspect the only way to do this is to keep the sum constraint, the N constraint, do the standard N-1, sort, and diff thing, but restrict the resolution of the individual randoms to your desired minimum (in other words, instead of 0..100, maybe generate 0..10 times 10).
Or, instead of generating N-1 uniformly random points along the line, generate a random sample of points along the line within a similar low-resolution constraint.