stddev gives different results for 2 identical datasets in U-SQL - u-sql

I have 2 csv datasets with 1 column 'v'. The data in these 2 files are exactly same.
The column 'v' contains values that would be converted to decimal? before performing stdev.
sum(v), avg(v) values are same in both the datasets,
but stdev values are not matching. How is it even possible ?
Here is the code,
#ds1 =
EXTRACT v decimal?
FROM #ds1_path
USING Extractors.Csv(skipFirstNRows : 1);
#ds2 = EXTRACT v decimal?
FROM #ds2_path
USING Extractors.Csv(skipFirstNRows : 1);
#data =
SELECT STDEV(v) AS stdev,
SUM(v) AS sum,
AVG(v) AS avg,
VAR(v) AS vari,
"ds1" AS type
FROM #ds1
UNION ALL
SELECT STDEV(v) AS stdev,
SUM(v) AS sum,
AVG(v) AS avg,
VAR(v) AS vari,
"ds2" AS type
FROM #ds2;
This gives the following output. If you notice sum, avg values are exactly same, but VAR and STDEV values are not matching.
Can somebody please help ?
output

Related

Medians Values in R - Returns Rounded Number

I have a table of data, where I've labeled the rows based on a cluster they fall into, as well as calculated the average of the rows column values. I would like to select the median row for each cluster.
For example sake, just looking at one, I would like to use:
median(as.numeric(as.vector(subset(df,df$cluster == i )$avg)))
I can see that
> as.numeric(as.vector(subset(df,df$cluster == i )$avg))
[1] 48.11111111 47.77777778 49.44444444 49.33333333 47.55555556 46.55555556 47.44444444 47.11111111 45.66666667 45.44444444
And yet, the median is
> median(as.numeric(as.vector(subset(df,df$cluster == i )$avg)))
[1] 47.5
I would like to find the median record, by matching the median returned with the average in the column, but that isn't possible with this return.
I've found some documentation and questions on rounding with the mean function, but that doesn't seem to apply to this unfortunately.
I could also limit the data decimal places, but some records will be too close, that duplicates will be common if rounded to one decimal.
When the input has an even number of values (like the 10 values you have) then there is not a value directly in the middle. The standard definition of a median (which R implements) averages the two middle values in the case of an even number of inputs. You could rank the data, and in the case of an even-length input select either the n/2 or n/2 + 1 record.
So, if your data was x = c(8, 6, 7, 5), the median is 6.5. You seem to want the index of "the median", that is either 2 or 3.
If we assume there are no ties, then we can get these answers with
which(rank(x) == length(x) / 2)
# [1] 2
which(rank(x) == length(x) / 2 + 1)
# [1] 3
If ties are a possibility, then rank's default tie-breaking method will cause you some problems. Have a look at ?rank and figure out which option you'd like to use.
We can, of course, turn this into a little utility function:
median_index = function(x) {
lx = length(x)
if (lx %% 2 == 1) {
return(match(median(x), x))
}
which(rank(x, ties.method = "first") == lx/2 + 1)
}
There is an easier way to do that: use dplyr
library(dplyr)
df%>%
group_by(cluster)%>%
summarise(Median=median(avg))

get location of row with median value in R data frame

I am a bit stuck with this basic problem, but I cannot find a solution.
I have two data frames (dummies below):
x<- data.frame("Col1"=c(1,2,3,4), "Col2"=c(3,3,6,3))
y<- data.frame("ColA"=c(0,0,9,4), "ColB"=c(5,3,20,3))
I need to use the location of the median value of one column in df x to then retrieve a value from df y. For this, I am trying to get the row number of the median value in e.g. x$Col1 to then retrieve the value using something like y[,"ColB"][row.number]
is there an elegant way/function for doing this? Solutions might need to account for two cases - when the sample has an even number of values, and ehwn this is uneven (when numbers are even, the median value might be one that is not found in the sample as a result of calculating the mean of the two values in the middle)
The problem is a little underspecified.
What should happen when the median isn't in the data?
What should happen if the median appears in the data multiple times?
Here's a solution which takes the (absolute) difference between each value and the median, then returns the index of the first row for which that difference vector achieves its minimum.
with(x, which.min(abs(Col1 - median(Col1))))
# [1] 2
The quantile function with type = 1 (i.e. no averaging) may also be of interest, depending on your desired behavior. It returns the lower of the two "sides" of the median, while the which.min method above can depend on the ordering of your data.
quantile(x$Col1, .5, type = 1)
# 50%
# 2
An option using quantile is
with(x, which(Col1 == quantile(Col1, .5, type = 1)))
# [1] 2
This could possibly return multiple row-numbers.
Edit:
If you want it to only return the first match, you could modify it as shown below
with(x, which.min(Col1 != quantile(Col1, .5, type = 1)))
Here, something like y$ColB[which(x$Col1 == round(median(x$Col1)))] would do the trick.
The problem is x has an even number of rows, so the median 2.5 is not an integer. In this case you have to choose between 2 or 3.
Note: The above works for your example, not for general cases (e.g. c(-2L,2L) or with rational numbers). For the more general case see #IceCreamToucan's solution.

Aggregation and typing inconsistency in `data.table`

I have a question related to Why does median trip up data.table (integer versus double)?. Except in my case I am using a maximum. I am excluding missing values. In base R, the max of a length 0 vector is -Inf which, interestingly is a double and not an integer. I think there may be a bug in data.table's recent optimization routines.
Take this data table:
dt <- data.table(id = c(1,1,1,3,3,3), num = 1:6, log = c(F,F,F,T,F,T))
If we perform:
dt[, .(mnum = max(num[log], na.rm=T)), by=id]
We find the error:
Error in `[.data.table`(dt, , .(mnum = max(num[log], na.rm=T)), by=id1] :
Column 1 of result for group 2 is type 'integer' but expecting type 'double'. Column types must be consistent for each group.
Am I correct in thinking this is a bug or is there a syntactic omission here?
The expected output would, of course be
mnum id
-Inf 1
6 3

How can I find average, sum and count of values of a single column in Pig?

I have a variable car_age which hold the distinct values of the age of the car in the entire CSV file. How can I take the average of all the values? I need to replace the outliers with the average (or mean) of car_age values.
Here is what I am doing currently.
training_data= LOAD '/user/All_State_Insurance_Prediction_Dataset/sampled_training_dataset/sampled_training_set';
A1 = foreach training_data generate car_age;
B1= Distinct A1;
B1 holds the distinct values of car age. How can I find the average, sum and count of the values in B1? I didn't use Group By as I need those operations to be done on a single list of values.
Try this for doing average
training_data= LOAD '/user/All_State_Insurance_Prediction_Dataset/sampled_training_dataset/sampled_training_set' USING PigStorage();
A1 = foreach training_data generate car_age;
B1= Distinct A1;
B1_grouped = GROUP B1 all;
B1_avg = FOREACH B1_grouped GENERATE AVG(B1);
Similarly you can do for SUM and other aggregate function

Mathematica - Extract actual data values to calculate expected value

I've the below data in variable X. The data is in the form of pair of numbers {a, b}.
a represents the actual value while b represents its frequency in the data set.
X = {{20, 30}, {21, 40}, {22, 50}}
I want to calculate expected value of this data set.
How can extract out all values of a in a separate data set ?
The expected value is (in non-Mma notation) sum(x[i]*p[i], i, 1, n) where x[i] is the i-th distinct value (i.e., first value in each pair), p[i] is the proportion of that value (i.e., second value in each pair divided by the total of all of the second values), and n is the number of distinct values of x (i.e., the number of pairs). I think this is enough to help you solve it now.

Resources