Random sampling with normal distribution from existing data in R - r

I have a large dataset of individuals who have rated some items (x1:x10). For each individual. the ratings have been combined into a total score (ranging 0-5). Now, I would like to draw two subsamples with the same sample size, in which the total score has a specific mean (1.5 and 3) and follows a normal distribution. Individuals might be part of both subsamples.
A guess to solve this, sampling with the outlined specifications from a vector (the total score) will work. Unfortunately, I have found only found different ways to draw random samples from a vector, but not a way to sample around a specific mean.
EDIT:
As pointed out I normal distribution would not be possible. Rather than I am looking way to sample a binomial distribution (directly from the data, without the work around of creating a similar distribution and matching).

You can't have normally distributed data on a discrete scale with hard limits. A sample drawn from a normal distribution with a mean between 0 and 5 would be symmetrical around the mean, would take coninuous rather than discrete values and would have a non-zero probability of containing values of less than zero and more than 5.
You want your sample to contain discrete values between zero and five and to have a central tendency around the mean. To emulate scores with a particular mean you need to sample from the binomial distribution using rbinom.
get_n_samples_averaging_m <- function(n, m)
{
rbinom(n, 5, m/5)
}
Now you can do
samp <- get_n_samples_averaging_m(40, 1.5)
print(samp)
# [1] 1 3 2 1 3 2 2 1 1 1 1 2 0 3 0 0 2 2 2 3 1 1 1 1 1 2 1 2 0 1 4 2 0 2 1 3 2 0 2 1
mean(samp)
# [1] 1.5

Related

Bayesian Question: Exponential Prior and Poisson Likelihood: Posterior?

I am needing assistance in a particular question and need confirmation of my understanding.
The belief is that absences in a company follow
a Poisson(λ) distribution.
It is believed additionally that 75% of thes value of λ is less than 5 therefore it is decided that a exponential distribution will be prior for λ. You take a random sample of 50 students and find out the number of absences that each has had over the past semester.
The data summarised below, note than 0 and 1 are binned collectively.
Number of absences
≤ 1 2 3 4 5 6 7 8 9 10
Frequency
18 13 8 3 4 3 0 0 0 1
Therefore in order to calculate a posterior distribution, My understanding is that prior x Likelihood which is this case is a Exponential(1/2.56) and a Poisson with the belief incorporated that the probability of less than 5 is 0.75 which is solved using
-ln(1-0.75)/(1/2.56)= 3.5489.
Furthermore a similar thread has calculated the Posterior to be that of a Gamma (sum(xi)+1,n+lambda)
Therefore with those assumptions, I have some code to visualise this
x=seq(from=0, to=10, by= 1)
plot(x,dexp(x,rate = 0.390625),type="l",col="red")
lines(x,dpois(x,3.54890),col="blue")
lines(x,dgamma(x,128+1,50+3.54890),col="green")
Any help or clarification surround this would be greatly appreciated

two phase sample weights

I am using survey package to analyze data from a two-phase survey. The first phase included ~5000 cases, and the second ~2700. Weights were calculated beforehand to adjust for several variables (ethnicity, sex, etc.) AND(as far as I understand) for the decrease in sample size when performing phase 2.
I'm interested in proportions of binary variables, e.g. suicides in sick vs. healthy individuals.
An example of a simple output I receive in the overall sample:
table (df$schiz,df$suicide)
0 1
0 4857 8
1 24 0
An example of a simple output I receive in the second phase sample only:
table (df2$schiz,df2$suicide)
0 1
0 2685 5
1 24 0
And within the phase two sample with the weights included:
dfw<-svydesign(ids=~1,data=df2, weights=df2$weights)
svytable (~schiz+suicide, design=dfW)
suicide
schiz 0 1
0 2701.51 2.67
1 18.93 0.00
My question is: Shouldn't the weights correct for the decrease in N when moving from phase 1 to phase 2? i.e. shouldn't the total N of the second table after correction be ~5000 cases, and not ~2700 as it is now?

Holt Winters predict results differ from fitted data dramatically

I am having a huge difference between my fitted data with HoltWinters and the predict data. I can understand there being a huge difference after several predictions but shouldn't the first prediction be the same number as the fitted data would be if it had one more number in the data set??
Please correct me if I'm wrong and why that wouldn't be the case?
Here is an example of the actual data.
1
1
1
2
1
1
-1
1
2
2
2
1
2
1
2
1
1
2
1
2
2
1
1
2
2
2
2
2
1
2
2
2
-1
1
Here is an example of the fitted data.
1.84401709401709
0.760477897417666
1.76593566042741
0.85435674207981
0.978449891674328
2.01079668445307
-0.709049507055536
1.39603638693742
2.42620183925688
2.42819282543689
2.40391946256294
1.29795840410863
2.39684770489517
1.35370435531208
2.38165200319969
1.34590347535205
1.38878761417551
2.36316132796798
1.2226736501825
2.2344269563083
2.24742853293732
1.12409156568888
Here is my R code.
randVal <- read.table("~/Documents/workspace/Roulette/play/randVal.txt", sep = "")
test<-ts(randVal$V1, start=c(1,133), freq=12)
test <- fitted(HoltWinters(test))
test.predict<-predict(HoltWinters(test), n.ahead=1*1)
Here is the predicted data after I expand it to n.ahead=1*12. Keep in mind that I only really want the first value. I don't understand why all the predict data is so low and close to 0 and -1 while the fitted data is far more accurate to the actual data..... Thank you.
0.16860570380268
-0.624454483845195
0.388808753990824
-0.614404235175936
0.285645402877705
-0.746997659036848
-0.736666618626855
0.174830187188718
-1.30499945596422
-0.320145850774167
-0.0917166719596059
-0.63970713627854
Sounds like you need a statistical consultation since the code is not throwing any errors. And you don't explain why you are dissatisfied with the results since the first value for those two calls is the same. With that in mind, you should realize that most time-series methods assume input from de-trended and de-meaned data, so they will return estimated parameters and values that would need in many cases to be offset to the global mean to predict on the original scale. (It's really bad practice to overwrite intermediate values as you are doing with 'test'.) Nonetheless, if you look at the test-object you see a column of yhat values that are on the scale of the input data.
Your question "I don't understand why all the predict data is so low and close to 0 and -1 while the fitted data is far more accurate to the actual data" doesn't say in what sense you think the "predict data[sic]" is "more accurate" than the actual data. The predict-results is some sort of estimate and they are split into components, as you would see if you ran the code on help page for predict:
plot(test,test.predict12)
It is not "data", either. It's not at all clear how it could be "more accurate" unless you has some sort of gold standard that you are not telling us about.

R: how to estimate a fixed effects model with weights

I would like to run a fixed-effects model using OLS with weighted data.
Since there can be some confusion, I mean to say that I used "fixed effects" here in the sense that economists usually imply, i.e. a "within model", or in other words individual-specific effects. What I actually have is "multilevel" data, i.e. observations of individuals, and I would like to control for their region of origin (and have corresponding clustered standard errors).
Sample data:
library(multilevel)
data(bhr2000)
weight <- runif(length(bhr2000$GRP),min=1,max=10)
bhr2000 <- data.frame(bhr2000,weight)
head(bhr2000)
GRP AF06 AF07 AP12 AP17 AP33 AP34 AS14 AS15 AS16 AS17 AS28 HRS RELIG weight
1 1 2 2 2 4 3 3 3 3 5 5 3 12 2 6.647987
2 1 3 3 3 1 4 3 3 4 3 3 3 11 1 6.851675
3 1 4 4 4 4 3 4 4 4 2 3 4 12 3 8.202567
4 1 3 4 4 4 3 3 3 3 3 3 4 9 3 1.872407
5 1 3 4 4 4 4 4 3 4 2 4 4 9 3 4.526455
6 1 3 3 3 3 4 4 3 3 3 3 4 8 1 8.236978
The kind of model I would like to estimate is:
AF06_ij = beta_0 + beta_1 AP34_ij + alpha_1 * (GRP == 1) + alpha_2 * (GRP==2) +... + e_ij
where i refer to specific indidividuals and j refer to the group they belong to.
Moreover, I would like observations to be weighted by weight (sampling weights).
However, I would like to get "clustered standard errors", to reflect possible GRP-specific heteroskedasticity. In other words, E(e_ij)=0 but Var(e_ij)=sigma_j^2 where the sigma_j can be different for each GRP j.
If I understood correctly, nlme and lme4 can only estimate random-effects models (or so-called mixed models), but not fixed-effects model in the sense of within.
I tried the package plm, which looked ideal for what I wanted to do, but it does not allow for weights. Any other idea?
I think this is more of a stack exchange question, but aside from fixed effects with model weights; you shouldn't be using OLS for an ordered categorical response variable. This is an ordered logistic modeling type of analysis. So below I use the data you have provided to fit one of those.
Just to be clear we have an ordered categorical response "AF06" and two predictors. The first one "AP34" is also an ordered categorical variable; the second one "GRP" is your fixed effect. So generally you can create a group fixed effect by coercing the variable in question to a factor on the RHS...(I'm really trying to stay away from statistical theory because this isn't the place for it. So I might be inaccurate in some of the things I'm saying)
The code below fits an ordered logistic model using the polr (proportional odds logistic regression) function. I've tried to interpret what you were going for in terms of model specification, but at the end of the day OLS is not the right way forward. The call to coefplot will have a very crowded y axis I just wanted to present a very rudimentary start at how you might interpret this. I'd try to visualize this in a more refined way for sure. And back to interpretation...You will need to work on that, but I think this is generally the right method. The best resource I can think of is chapters 5 and 6 of "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Gelman and Hill. It's such a good resource so I'd really recommend you read the whole thing and try to master it if you're interested in this type of analysis going forward.
library(multilevel) # To get the data
library(MASS) # To get the polr modeling function
library(arm) # To get the tools, insight and expertise of Andrew Gelman and his team
# The data
weight <- runif(length(bhr2000$GRP),min=1,max=10)
bhr2000 <- data.frame(bhr2000,weight)
head(bhr2000)
# The model
m <- polr(factor(AF06) ~ AP34 + factor(GRP),weights = weight, data = bhr2000, Hess=TRUE, method = "logistic")
summary(m)
coefplot(m,cex.var=.6) # from the arm package
Check out the lfe package---it does econ style fixed effects and you can specify clustering.

Difficulty Comparing Clusters in R - Inconsistant Labeling

I am attempting to run a monte carlo simulation that compares two different clustering techniques. The following code generates a dataset according to random clustering and then applies two clustering techniques (kmeans and sparse k means).
My issue is that these three techniques use different labels for their clusters. For example, what I call cluster 1, kmeans might call it cluster 2 and sparse k means might call it cluster 3. When I regenerate and re-run, the differences in labeling do not appear to be consistent. Sometimes the labels agree, sometimes they do not.
Can anyone provide a way to 'standardize' these labels so I can run n iterations of the simulation without having to manually resolve labeling differences each time?
My code:
library(sparcl)
library(flexclust)
x.generate=function(n,p,q,mu){
c=sample(c(1,2,3),n,replace=TRUE)
x=matrix(rnorm(p*n),nrow=n)
for(i in 1:n){
if(c[i]==1){
for(j in 1:q){
x[i,j]=rnorm(1,mu,1)
}
}
if(c[i]==2){
for(j in 1:q){
x[i,j]=rnorm(1,-mu,1)
}
}
}
return(list('sample'=x,'clusters'=c))
}
x=x.generate(20,50,50,1)
w=KMeansSparseCluster.permute(x$sample,K=3,silent=TRUE)
kms.out = KMeansSparseCluster(x$sample,K=3,wbounds=w$bestw,silent=TRUE)
km.out = kmeans(x$sample,3)
tabs=table(x$clusters,kms.out$Cs)
tab=table(x$clusters,km.out$cluster)
CER=1-randIndex(tab)
Sample output of x$clusters, km.out$cluster, kms.out$Cs
> x$clusters
[1] 3 2 2 2 1 1 2 2 3 2 1 1 3 1 1 3 2 2 3 1
> km.out$cluster
[1] 3 1 1 1 2 2 1 1 3 1 2 2 3 2 2 3 1 1 3 2
> km.out$Cs
[1] 1 2 2 2 3 3 2 2 1 2 3 3 1 3 3 1 2 2 1 3
One of the most used criterion of similarity is the Jaccard distance See for instance Ben-Hur, A.
Elissee, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered
data. Pacific Symposium on Biocomputing (pp.6--17).
Others include
Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association , 78 , 553--569
Hubert, L., & Arabie, P . (1985). Comparing partitions. Journal of Classification , 2 , 193--218.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the Americ an Statistical Association , 66 , 846--850
As #Joran points out, the clusters are nominal, and thus do not have an order per se.
Here are 2 heuristics that come to my mind:
Starting from the tables you calculate already: when the clusters are well aligned, the trace of the tab matrix is maximal.
If the number of clusters is small, you could find the maximum by trying all permutations of 1 : n of method 2 against the $n$ clusters of method 1. If it is too large, you may go with a heuristic that first puts the biggest match onto the diagonal and so on.
Similarly, the trace of the distance matrix between the centroids of the 2 methods should be minimal.
K-means is a randomized algorithm. You must expect them to be randomly ordered, actually.
That is why the established evaluation methods for clusters (read the Wikipedia article on clustering, in particular the section on "external validation") do not assume that there is a one-on-one mapping of clusters.
Even worse, one clustering algorithm may find 3 clusters, another one may find 4 clusters.
There are also hierarchical clustering algorithms. There each object can belong to many clusters, as clusters can be nested in each other.
Also some algorithms such as DBSCAN have a notion of "noise": These objects do not belong to any cluster.
I would not recommend the Jaccard distance (even though it is famous and well established) as it is hugely influenced by cluster sizes. This is due to the fact that it counts node pairs rather than nodes. I also find the methods with a statistical flavour to be missing the point. The point is that the space of partitions (clusterings) have a beautiful lattice structure. Two distances that work beautifully within that structure are the Variation of Information (VI) distance and the split/join distance. See also this answer on stackexchange:
https://stats.stackexchange.com/questions/24961/comparing-clusterings-rand-index-vs-variation-of-information/25001#25001
It includes examples of all three distances discussed here (Jaccard, VI, split/join).

Resources