Grouping in R changes mean substantially - r

I have a file containing the predictions for two models (A and B) on a binary classification problem. Now I'd like to understand how good they are predicting the observations that they are most confident about. To do that I want to group their predictions into 10 groups based on how confident they are. Each of these groups should have an identical number of observations. However, when I do that the accuracy of the models change substantially! How can that be?
I've also tested with n_groups=100, but it only makes a minor difference. The CSV file is here and the code is below:
# Grouping observations
conf <- read.table(file="conf.csv", sep=',', header=T)
n_groups <- 10
conf$model_a_conf <- pmax(conf$model_a_pred_0, conf$model_a_pred_1)
conf$model_b_conf <- pmax(conf$model_b_pred_0, conf$model_b_pred_1)
conf$conf_group_model_a <- cut(conf$model_a_conf, n_groups, labels=FALSE, ordered_result=TRUE)
conf$conf_group_model_b <- cut(conf$model_b_conf, n_groups, labels=FALSE, ordered_result=TRUE)
# Test of original mean.
mean(conf$model_a_acc) # 0.78
mean(conf$model_b_acc) # 0.777
# Test for mean in aggregated data. They should be similar.
(acc_model_a <- mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean))) # 0.8491
(acc_model_b <- mean(tapply(conf$model_b_acc, conf$conf_group_model_b, FUN=mean))) # 0.7526
Edited to clarify slightly.

table(conf$conf_group_model_a)
1 2 3 4 5 6 7 8 9 10
2515 2628 2471 2128 1792 1321 980 627 398 140
The groups you are using are unbalanced. So when you take the mean of each of those groups with tapply thats fine, however to simply take the mean afterwards is not the way to go.
You need to weight the means by their size if you want to do the process you have.
something like this is quick and dirty:
mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean) * (table(conf$conf_group_model_a)/nrow(conf)) * 1000)

Related

R: adjusting a given time-series but keeping summary statistics equal

Let's say I have a time-series like this
t x
1 100
2 50
3 200
4 210
5 90
6 80
7 300
Is it possible in R to generate a new dataset x1 which has the exact same summary statistics, e.g. mean, variance, kurtosis, skew as x?
The reason for my asking is that I would like to do an experiment where I want to test how subjects react to different graphs of data that contain the same information.
I recently read:
Matejka, Justin, and George Fitzmaurice. "Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing." Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 2017.
Generating Data with Identical Statistics but Dissimilar Graphics: A Follow up to the Anscombe Dataset, The American Statistician, 2007,
However, Matejka uses code in Python that is quite scientific and their data is more complex than time-series data, so I was wondering if there was a way to do this more efficiently for a simpler data set?
Best regards
I'm not aware of a package that can give you precisely what you are looking for. One option is using the datasets in the datasauRus package as JasonAizkalns pointed out. However, if you want to create your own dataset, you can try this:
Fit the Johnson distribution from the SuppDists package to get the moments of the dataset and draw new sets from that distribution until the difference is sufficiently small. Below an example with your dataset, although more observations make it easier to replicate the summary statistics:
library(SuppDists)
a <- c(100,50,200,210,90,80,300)
momentsDiffer <- function(x1,x2){
diff <- sum(abs(moments(x1)- moments(x2)))
return(diff)
}
repDataset <- function(x,n){
# fit Johnson distribution
parms<-JohnsonFit(a, moment="quant")
# generate from distribution n times storing if improved
current <- rJohnson(length(a),parms)
momDiff <- momentsDiffer(x,current)
for(i in 1:n){
temp <- rJohnson(length(a),parms)
tempDiff <- momentsDiffer(x,temp)
if(tempDiff < momDiff){
current <- temp
momDiff <- tempDiff
}
}
return(current)
}
# Drawing 1000 times to allow improvement
b <- repDataset(a,1000)
> moments(b)
mean sigma skew kurt
148.14048691 84.24884165 1.04201116 -0.05008629
> moments(a)
mean sigma skew kurt
147.1428571 84.1281821 0.5894543 -1.0198303
EDIT - Added additional method
Following the suggestion of #Jj Blevins, the method below generates a random sequence based upon the original sequence leaving out 4 observations. Those 4 observations are then added through solving a non-linear equation on the difference between the four moments of the original sequence and the new sequence. This still not generate a perfect match, feel free to improve.
library(nleqslv)
library(e1071)
set.seed(1)
a <- c(100,50,200,210,90,80,300)
#a <- floor(runif(1000,0,101))
init <- floor(runif(length(a)-4,min(a),max(a)+1))
moments <- moments(a)
f <- function(x) {
a <- mean(c(init,x))
b <- var(c(init,x))
c <- skewness(c(init,x))
d <- kurtosis(c(init,x))
c(a-moments[1],b-moments[2],c-moments[3],d-moments[4])
}
result <- nleqslv(runif(4,min(a),max(a)+1), f,control=list(ftol=.00000001, allowSingular=TRUE))
> moments(c(init,result$x))
mean sigma skew kurt
49.12747961 29.85435993 0.03327868 -1.25408078
> moments(a)
mean sigma skew kurt
49.96600000 29.10805462 0.03904256 -1.18250616

predict.lm after regression with missing data in Y

I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.
# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))
# Regress X and Y
model<-lm(y~x+1)
summary(model)
# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)
Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
replacement has 9 rows, data has 10
I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.
Thank you in advance for your assistance!
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action is na.omit, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
Actually, you are not using correctly the predict.lm function.
Either way you have to input the model itself as its first argument, hereby model, with or without the new data. Without the new data, it will only predict on the training data, thus excluding your NA row and you need this workaround to fit the initial data.frame:
df$y_ip[!is.na(df$y)] <- predict.lm(model)
Or explicitly specifying some new data. Since the new x has one more row than the training x it will fill the missing row with a new prediction:
df$y_ip <- predict.lm(model, newdata = df)

Cluster assignment after estimate in a large dataset (Mclust)

I've been doing a clustering analysis with a relative large dataset (~50.000 observations and 16 variables).
library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data
As my computer was unable to do it, I did a few subsets of information (10 x 5.000, 16.000 in the example, but 15min computing) and I was using Mclust to determine the optimal number of groups.
ind<- sample(1:nrow(mdper.f),size=16000)#sampling especial with 16.000, 15min cumputing
nfac <- mdper.f[ind,]#sampling
Fnac <- scale(nfac) #scale data
mod = Mclust(Fnac) #Determining the optimal number of clusters
summary(mod) #Summary
#Results:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VII (spherical, varying volume) model with 9 components:
log.likelihood n df BIC ICL
128118.2 16000 80 255462 254905.3
Clustering table:
1 2 3 4 5 6 7 8 9
1879 2505 3452 3117 2846 464 822 590 325
Resulting always 9 (10 out 10 of datasets of 5.000), so, I guess it's okay..
Now, I would like to assign to the rest of the data the estimated cluster divisions in order to the multidimensional parts of the cluster.
How can I do it?
I started to play with the Mclust object but I can't see how to handle it and apply to the rest of the data.
The optimal solution would be my original data with an extra column with the cluster number (1 to 9) assigned, for example.
I've got the answer after few minutes working:
First of all, there is a concept mistake, dataset must be scaled before partitioning and then just using predict()
library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data
mdper.f.s <- scale(mdper.f)#Scaling data
ind<- sample(1:nrow(mdper.f.s),size=16000)#sampling with 16.000
nfac <- mdper.f.s[ind,]#sampling
mod16 = Mclust(nfac)#Determining the optimal number of clusters, 15min cumputing with 7 vars
prediction<-predict(mod16 ,mdper.f.s )#Predict with calculated model and scaled data
mdper.f <- cbind(mdper.f,prediction$classification)#Assignment to the original data
colnames(mdper.f.pred)[8]<-"Cluster" #Assing name to the new column

Applying ezANOVA error work-around to Long Format data

I have a similar problem as described here:
https://stats.stackexchange.com/questions/58435/repeated-measures-error-in-r-ezanova-using-more-levels-than-subjects-balanced-d
Here is an example of what my dataframe looks like:
Participant Visual Audio StimCondition Accuracy
1 Bottom Circle 1st 2 Central Beeps AO2 0.92
1 SIM Circle Left Beep AO2 0.86
2 Bottom Circle 1st 2 Central Beeps CT4 0.12
2 SIM Circle Left Beep CT4 0.56
I have 3 Visual conditions, 5 Audio conditions & 5 StimConditions & 12 participants exposed to all conditions.
When I run the following ezANOVA:
Model <- ezANOVA(data = Shaped.means, dv = .(Accuracy), wid = .(Participant), within = .(Visual, Audio, StimCondition), type = 3, detailed = TRUE)
I get the same error as the linked question above. I have tried changing Type to equal 1 and it does return the output but minus the Sphericity Test.
I've tried to apply the solution to the linked question to my dataset but as mine is in Long Format I'm a bit lost as to what exactly I need to do to achieve the desired stats.
I'll keep playing with it my end but if anyone could help in the mean time it would be much appreciated.
Thanks.
Following the linked question, you have don't have to change much. Assuming your dataset is exactly as you describe, the following should work for you.
Let's first create a dataset to reflect your description
set.seed(123) ## make reproducible
N <- 12 ## number of Participants
S <- 5 ## number of StimCondition groups
V <- 3 ## number of Visual groups
A <- 5 ## number of Audio groups
Accuracy <- abs(round(runif(N*V*S*A), 2)) ## (N x (PxQ))-matrix with voltages
init.Df <- expand.grid(Participant=gl(N,1),
Visual=gl(V, 1),
Audio=gl(A, 1),
StimCondition=gl(S,1))
df <- cbind(init.Df, Accuracy)
Now we have a dataframe with 3 Visual conditions, 5 Audio conditions & 5 StimConditions & 12 participants exposed to all conditions. This should be at the stage you are currently at. We can do the between-subjects call easily.
# If you just read in the data set and don't know how many subjects
# N <- length(unique(df$Participant))
fit <- lm(matrix(df[,c("Accuracy")], nrow=N) ~ 1)
For the factor component, this is the only real change. If you simply generate your model design, you can pass it to anova.
library(car)
# You can create your within design table
# You can get these values from your dataset as well
# V <- nlevels(df$Visual)
# A <- nlevels(df$Audio)
# S <- nlevels(df$StimCondition)
# If you want the labels with gl, you can use the levels function (e.g. labels=levels(df$Visual))
inDf <- expand.grid(Visual=gl(V, 1),
Audio=gl(A, 1),
StimCondition=gl(S,1))
# Test for Visual
anova(fit, M=~Visual, X=~1, idata=inDf, test="Spherical")
# Test for Audio
anova(fit, M=~Visual+Audio, X=~Visual, idata=inDf, test="Spherical")
# Test for Visual:Audio interaction
anova(fit, M=~Visual+Audio+Visual:Audio, X=~Visual+Audio, idata=inDf, test="Spherical")
#etc...

Simlulating a t-test in R

I am looking for a way to simulate the power of a simple t-test in different sample sizes. My idea is to generate 400 random normal distribution samples, each with mean 5 and variance 1, and perform a t-test on each one concerning the hypothesis that the true mean is 4, i.e. the t-test would take the form:
t=(mean(x)-4)*sqrt(n)/sd(x) # for each sample x which consists of n observations.
For comparison I would like, the first 100 samples to consist of 10 observations, the next 100 ones of 100, the next 100 of 1000 and finally the last 100 of 5000, which I think is the upper limit. A t-test will have to be performed on each and every sample.
Lastly, I would like to see on what percentage of each sample group- let's call them, n10,n100,n1000,n5000, depending on how many observations they comprise- my (false) hypothesis is rejected.
Could you please help me write the corresponding R-code? I know the small commands but have trouble putting it all together. This is a nice exercise and hopefully I shall then be able to modify it a bit and use it for different purposes as well.
Thank you in advance.
Here's a one liner for 400 t.tests of n=10:
R>simulations <- replicate(400, t.test(rnorm(10, mean=5, sd=1), mu=4),
simplify=FALSE);
Then you can analyze it:
R>table(sapply(simulations, "[[", "p.value") < .05)
FALSE TRUE
75 325
I'm still learning R, too, so handle with care:
n <- 5
N <- 100
samplesizes <- as.integer(10^(1:n))
set.seed(1)
# generate samples
samples <- replicate(N, mapply(rnorm, samplesizes, mean=4, sd=sqrt(1)))
# perform t-tests
t.tests <- lapply(samples, function(x) t.test(x, mu=5, alternative="two.sided"))
# get p-values
t.test.pvalues <- sapply(t.tests, function(x) x$p.value)
rejected <- t.test.pvalues > .05
sampleIndices <- rep(1:n, N)
res <- aggregate(rejected, list(sample=sampleIndices), FUN=function(x) sum(x)/length(x) )
names(res)[2] <- "percRejected"
print(res, row.names=F)
# sample percRejected
# 1 0.16
# 2 0.00
# 3 0.00
# 4 0.00
# 5 0.00

Resources