How do you properly standardize a variable that was collected with quota sampling? Let me explain.
I am working with survey data that was collected with quota sampling. Each quota is a different village (1500 in total). The questionnaire was applied to 10% of each village's population. The villages vary a lot in size: from tens of thousands to a mere hundreds.
I am working with logit models and want to standardize one of my dataframe's columns. Should I standardize as is? Or would the population imbalance between the villages bias my standardize variable? Should I include population weights?
To illustrate with data, lets imagine there are only two villages (village 1 is big and village 2 is small). This is how the data would look like
total1 <- data.frame("response1" = c(0.4, -0.1, 2.1, 0.08, 0, -2.5),
"village.number" = c(1, 1, 1, 1, 2, 2))
The question stands: how do I standardize response1 when observations from village 1 double those of village 2.
Thank you.
Related
The question may be better suited for a board like cross-validated, but I am asking here to elicit some inputs.
I'm trying to construct a crude measure to gauge the similarity between any pairs of objects across multiple dimensions (or categories (for example, they can be percentages of GDP across economic sectors or students' grades in multiple subjects)).
Some potential candidates I have in mind are latent topics approach from the LDA (Latent Dirichlet Allocation), which assign (non-zero) probabilities for each unit across a list of K clusters, and word2vec that measures the similarity between any two corpora based on the vectorized scores of their texts. But given that the objects I want to deal with usually have a fixed number of categories (e.g., academy subjects, economic sectors) and bounded distribution (say between 0 and 100). I wonder what will be a more appropriate measure for this task? A measure between 0 and 1 will be ideal.
Also, I want to do this in a pairwise manner, so that for each unit from a total of N units, the similarity measure is calculated for each unit in comparison to the rest of the N-1 units. For example, s11 (which is just 1), s12, s13, s14, their scores may be different from s21, s22, s23, s24, so on and so forth. Eventually, I want to rearrange it into an N times N matrix for further processing.
I provide export statistics (4 main commodity categories from the WTO database) below as an example, hoping to use this example to find a way to (1) construct a crude measure for comparing trade (export) profile similarity between any country pairs and (2) arrange the output into a 4 by 4 matrix.
profile = data.frame("country" = c("Afghanistan", "Albania", "Belgium", "Canada"),
"Agricultural products"=c(65.8, 11, 10.9, 15.3),
"Manufactures" = c(5.9, 69.7, 75.7, 47.9),
"Fuels and mining products" = c(1, 19.2, 12.6, 29),
"Others"=c(27.3, 0.7, 0.9, 7.8)
)
Hope someone could share his/her insights with me.
LDA is not the droid you are looking for here. If you just have vector data that you want to make pairwise comparisons for, a good place to start would be cosine similarity. As long as your data isn't too high-dimensional, cosine similarity will enable you to find pairs of countries, for instance, that have similar trading habits.
I am trying to estimate the probability of a family having two children and they both being girls using the rbinom function. I am trying to get the probability of this question and the rbinom does not give the probability it just gives random values. How can I get it to probability?
The usage of rbinom is
rbinom(
n, # The number of samples (or iterations if you like)
size, # The size of the sample
prob # The probability of "success" for your definition of success
)
If you assume the probability of having a girl is 50% (it's not - there are a number of confounding variables and the actual birth ratio of girls to boys is closer to 100:105), then you can calculate the outcome of one "trial" with a sample size of 2 children.
rbinom(1, 2, 0.5)
You will get an outcome of 0, 1, or 2 girls (it is random). This does not give you the probability that they are both girls. You have to complete multiple "trials".
Here is an example with n = 10. I am using set.seed to provide a specific initial state to the RNG to make the results reproducible.
set.seed(100)
num_girls <- rbinom(10, 2, 0.5)
You can do a little trick to find out how many times there are two girls.
sum(num_girls == 2) # Does the comparison and adds; TRUE = 1 and FALSE = 0
1
So there are 1/10 times when you get two girls. The number of trials is not enough to approach the true probability yet, but you should get the idea.
The code below is about oversampling houses with over 10 rooms, may I ask what does prob = ifelse(housing.df$ROOMS>10, 0.9, 0.01) mean? Thanks a lot.
s <- sample(row.names(housing.df), 5, pro = ifelse(housing.df$ROOMS>10, 0.9, 0.01))
housing.df[s.]
I imagine the purpose of this ccode is to first check to see if a given house in the data set has ten rooms. If that is the case then it gets a probability of 90%, otherwise it gets a probability of 10%
sample with sample from the given house names using this associated probability thus favouring those houses with more than ten rooms when it samples. This creates your over sample.
Is this what you mean?
I seek guidance on R code to make a probability calculation based on survey data. The survey asked respondents to select from five descriptions of how often they sought advice (the “frequency” variable in the data frame below). The number of respondents choosing each description is the “respondent” variable and the maximum I have assumed for each frequency is in the “max.year.est” variable.
use <- data.frame(frequency = c("rarely; less than once per quarter",
"very occasionally; about every other month",
"occasionally; about once per month",
"fairly frequently; more than once per month",
"very frequently; once a week or more"),
respondents = c(50, 40, 30, 20, 10),
max.year.est = c(3, 6, 12, 18, 78))
The dplyr call mutates three columns to the use data frame that each present an annual total of requests for advice by the respondent group after multiplying the respondents (i) by a maximum number for each range -- which makes assumptions about the top of a range in the two most frequent; (ii) by the mid-point of max.requests for a more reasonable intermediate assumption; and (iii) by 40% of mean.requests, a figure pulled out of a hat because it seems reasonable that lower numbers of requests will be more common in each range than higher numbers.
use %>% group_by(frequency) %>%
mutate(max.requests = respondents * max.year.est) %>%
mutate(mean.requests = 0.5*max.requests) %>%
mutate(lower.requests = 0.4*max.requests)
If we assume the annual requests for advice in each range are distributed in a reasonable pattern, with more respondents making requests at smaller numbers in a range and fewer as you move up the range, what is a statistical method and R code (Poisson distribution?) to arrive at a defensible number of total annual requests in each range given the assumptions above?
Thank you for your comments and answers.
I am looking for the best way or best package available for simulating a genetic association between a specific SNP and a quantitative phenotype, with the simulated data being the most similar to my real data, except that I know the causal variant.
All of the packages I saw in R seem to be specialised in pedigree data or in population data where coalescence and other evolutionary factors are specified, but I don't have any experience in population genetics and I only want to simulate the simple case of European
population with a similar characteristics to my real data
(i.e. normal distribution for the trait and an additive effect for the genotype, similar allele frequancies…)
So for example if my genetic data is X and my quantitative variable is Y:
X <-rbinom(1000,2,0.4)
Y <- rnorm(1000,1,0.4)
I am looking for something in R similar to the function in Plink where one needs to specify a range of allele frequencies, a range for the phenotype, and specify a specific variant which should result associated with the genotype (this is important because I need to repeat these associations in different datasets with the causal variant being the same)
Can someone please help me?
If the genotype changes only the mean of the phenotype, this is very simple.
phenotype.means <- c(5, 15, 20) # phenotype means for genotypes 0, 1, and 2
phenotype.sd <- 5
X <- rbinom(1000,2,0.4)
Y <- rnorm(1000, phenotype.means[X], phenotype.sd)
This will lead to Y containing 1000 normally distributed variables, where those with homozygous recessive genotypes (aa, or 0) will have a mean of 5, those with heterozyous genotypes (Aa, or 1) will have a mean of 15, and those with homozygous dominant genotypes (AA, or 2) will have a mean of 20.
If you want a more traditional 2 setting phenotype (AA/Aa versus aa), just set phenotype.means to something like c(5, 20, 20).