I need to assign weights on a sample for a country. I already have the population by region 85 (regions) but I cannot perform the cluster sampling. Basically, I need to create 100 clusters each with 15 units. Overall 1500 respondents. I have an excel file with all the variables for the 85 regions.
Question 1:
How can I use the already generated population probability to do a weighted randomization for 100 clusters (with 15 units each)?
Question 2:
I need to draw from the 85 regions and generate 100 clusters. Logically, the capital and some of the other big cities should have more than 1 clusters due to higher population which gives them higher probability of having a cluster. Thus, How can I draw the clusters (15 units each) and assign a number of clusters to the different regions? For instance, the cluster probability is 0.08 percent and this will mean that I need 8 clusters of the 100 (15 units each) to be assigned to the capital. How do I add that column?
Specifically the problem with my current results is that I cannot generate the column with the number of clusters per region. For instance, region A to have 3 clusters, while region B 1 and so forth.
Here is my code:
data1$clusProb1 = (data1$Population.2018)/sum(data1$Population.2018)
sampInd = c(1:length(data1$Federal.Subject),sample(1:length(data1$Federal.Subject), length(data1$Federal.Subject)*14, prob = data1$clusProb, replace = TRUE))
sampFields = data.frame(id = 1:(length(data1$Federal.Subject)*15), Gender = sample(c(0,1), length(data1$Federal.Subject)*15, replace=TRUE), replace=TRUE))
sampleData = cbind(data1[sampInd,],sampFields)
sampleData
summary(sampleData)
The result should look like:
Cluster number Region
1 A
2 A
3 A
4 C
5 D
6
NOTE: A representing the regions with higher population which should have more clusters assigned to them.
Related
I aim at producing a typology of sites based through hierarchical clustering of species abundance data. Therefore, I successively cut the dendrogram into 2, 3, 4 ... z groups.
The cluster group names are automatically attributed by the function cutree() representing numbers from 1 to z in a non-consistent manner. For instance, in a clustering with three groups, "group 2" may not correspond to "group 2" in a clustering with six groups. This makes interpreting the dendrogram very difficult.
The code below provides a reproducible example. It produces a hierarchical clustering of 50 observations and successively cuts the dendrogram in a for loop. The final output data frame 'cluster.grps' contains the cluster group affiliation for each obervation and successive cutting height (HC_2 = hierarchical clustering with 2 groups; HC_3 = hc with three groups; etc.).
set.seed(1)
data <- data.frame(replicate(10,sample(0:10,50,rep=TRUE))) # create random site x species dataframe
clust <- hclust(dist(data), method = "ward.D") # implement hierarchical clustering
# Set maximum number of groups
z <- 6
# Loop for successive tree cutting
lst <- list()
for (i in 2:z) {
# Slicing the dendrogram
cutree <- cutree(clust, k = i) # k = number of groups
lst[[(i-1)]] <- cutree
}
names(lst) <- 2:(z-1)
cluster.grps <- as.data.frame(lst)
colnames(cluster.grps) <- paste("HC",as.character(2:(z)),sep ="_")
I now wish to attribute dichotomous names that represent the level of hierarchy in the tree: 1, 2 for the first level; 1.1, 1.2, 2.1, 2.2 for the second level; 1.1.1, 1.1.2, 1.2.1, 1.2.2, etc. for the third level and so on.
Ideally, the table 'cluster.grps' would look like this:
Site
HC_2
HC_3
HC_4
Site 1
1
1.1
1.1
Site 2
2
2
2.1
Site 3
1
1.2
1.2
Site 4
2
2
2.2
My first thought was to code nested clusterings in which I start with a first clustering of all observations into two groups and subsequently splitting each group of the first clustering independently into two consecutive groups, yielding four groups at the second hierarchical level. This requires quite a long code, though and I was wondering whether there might be a more elegant way.
Any thoughts?
I am trying to run a nested anova in R.
I have 3 unique factors: Vegetation type, transect number, and distance.
I am trying to determine if humidity differs among vegetation type.
There are three transects in each of the three vegetation types (labelled 1-9). Across each transect are 8 distances (range from 50 m to 400 m). At each of these distances, humidity was measured (e.g., at 50 m, measure humidity, at 100 m measure humidity).
This is the code I originally tried:
nest=aov(Data$Temp_400m ~ Data$Vegetation / factor(Data$Transect)) summary(nest)
I am also wondering if I need to convert transect number and distance to categorical values (i.e., instead of transect # 1-9, it would be A-I, and instead of distance 50 - 400, it would be 50m... 400m).
In R, I would like to generate a multinomially distributed random number vector of a given size N, for example using rmultinom, but with a maximum size for each of the K boxes.
For example:
set.seed(1)
draw = rmultinom(n = 1, size = 1000, prob = c(67,211,264,166,144,52,2,175))
In this case, the size is 1000, specifying the total number of objects that are put into eight boxes (the length of prob), and prob = c(67,211,264,166,144,52,2,175) the vector of probabilities for the eight boxes (which is internally normalized to sum 1). In addition, I would like c(67,211,264,166,144,52,2,175) to be the vector of the maximum size for each of the eight boxes.
However in this case, it is possible to generate numbers that are higher than c(67,211,264,166,144,52,2,175) (for instance in the example above, draw[7,]=4 is higher than 2), whereas I would like each number to be lower or equal to the maximum size of each box specified in prob, in addition to draw summing to size = 1000.
Do you know any function or any simple way to do that? I was not able to find the answer.
From wikipedia: "For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories".
The keyword here is independent. Your constraint on the number of times each category can be drawn means the sampling is not independent. If your problem were multinomial, it would be possible - though very unlikely - that all numbers could be drawn from box 7. This is not what you want, so you can't use rmultinom.
Here's a different approach:
# vector of item counts
m <- c(67,211,264,166,144,52,2,175)
# expand the item counts in to a single vector with i repeated m[i] times
d <- unlist(lapply(1:length(m), function(x) rep(x, m[x])))
# sample from d without replacement
s <- sample(d, size=1000, replace=FALSE)
# count the number of items of each type were sampled
table(factor(s))
1 2 3 4 5 6 7 8
63 197 242 153 135 48 2 160
I have data for 100 households that were randomly sampled from a larger community. I would now like to resample 10,000 grabs of 20 households from the original 100 household sample using different cluster sampling methods (i.e. 10 clusters of 2 households, 5 clusters of 4 households). Each cluster would consist of a randomly selected observation and the n observations immediately following it. Ex. For 10 clusters of 2 households, each cluster would consist of a randomly selected household and the household that immediately follows it in the dataset. For 5 clusters of 4 households, each cluster would consist of a randomly selected household and the 3 households that immediately follow it in the dataset.
I have been able to achieve the desired resampling output for 10,000 grabs of 20 households using simple random sampling with the following:
dat <- data.frame(hh_id = c(1:100), var = sample(1:200, 100, replace = T))
rs <- NULL
for(i in 1:10000){rs[i] = list(dat[sample(nrow(dat), 20, replace=TRUE),])}
How would I achieve the same output, but randomly selecting 10 clusters of 2 households (i.e. 20 households total per grab) instead of simple random sampling. I have looked at infer, sample, and resample packages, as well as others, and thoroughly looked through other posts here, but can't seem to find a solution that applies.
Ultimately, I will compare the variance of each sampling method from the 100 household mean to find a balance between accuracy and efficiency. If there is a shortcut to bootstrap all this directly, I would also be interested in that.
I have this simple dataset. The dataset is by hypothetical geographical unit (i.e. postal code) and has 3 variables: longitude, latitude and someValue (sales).
lon<-rep(1:10,each=10)
lat<-rep(1:10,10)
someValue<-rnorm(100, mean = 20, sd = 5)
dataset<-data.frame(lon,lat,someValue)
The problem I’m facing is territory alignment. Given a proposed number of territories I need to group postal codes into territories in such a way that the territories consist of adjacent postal codes and the sum of someValue is roughly the same (+/- 15% of the average for the specified number of territories)
The best idea I have at this point is to: 1. do clustering on lon/lat first to establish candidates; 2. do clustering on someValue using centroids from step 1 as centers with iter.max=1; 3 iterate over 1 and 2 until some convergence cut-off.
I would like to ask the community: what would be a proper methodology to implement something like this in R? I did search for Spatial Clustering and was not able to find anything relevant
you can do the clustering using kmeans by only considering the first two columns (x and y):
#How Many cluster do you want to have initially?
initialClasses <- 2
#clustering using kmeans
initClust <- kmeans(dataset[,1:2], initialClasses, iter.max = 100)
dataset$classes <- initClust$cluster
initClust$cluster then contains your cluster classes. You can add them to your dataframe and use dplyr to calculate some statistics. For example to sum of someValue per cluster:
library(dplyr)
statistics <- dataset %>% group_by(classes) %>%summarize(sum=sum(someValue))
Here for example the sum of someValue over two classes:
classes sum
(int) (dbl)
1 1 975.7783
2 2 978.9166
Let's say your data is equally distributed and you want the sum of someValue per cluster to be smaller. Then you need to rerun the clustering with more (i.e. 3) classes:
newRun <- kmeans(dataset[,1:2], 3, iter.max = 100)
dataset$classes <- newRun$cluster
Here the output statistics for three classes:
classes sum
(int) (dbl)
1 1 577.6573
2 2 739.9668
3 3 637.0707
By wrapping this inside a loop and calculating more criteria (i.e. variance) you can tune your clustering into the right size. Hope it helps.