Using weights in R to consider the inverse of sampling probability [duplicate] - r

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 6 years ago.
This is similar but not equal to Using weights in R to consider the inverse of sampling probability.
I have a long data frame and this is a part of the real data:
age gender labour_situation industry_code FACT FACT_2....
35 M unemployed 15 1510
21 F inactive 00 651
FACT is a variable that means, for the first row, that a male unemployed individual of 35 years represents 1510 individuals of the population.
I need to obtain some tables to show relevant information like the % of employed and unemployed people, etc. In Stata there are some options like tab labour_situation [w=FACT] that shows the number of employed and unemployed people in the population while tab labour_situation shows the number of employed and unemployed people in the sample.
A partial solution could be to repeat the 1st row of the data frame 1510 times and then the 2nd row of my data frame 651 times? As I've searched one options is to run
longdata <- data[rep(1:nrow(data), data$FACT), ]
employment_table = with(longdata, addmargins(table(labour_situation, useNA = "ifany")))
The other thing I need to do is to run a regression having in mind that there was cluster sampling in the following way: the population was divided in regions. This creates a problem: one individual
interviewed in represents people while an individual interviewed in represents people but and are not in proportion to the total population of each region, so some regions will be overrepresented and other regions will be underrepresented. In order to take this into account, each observation should be weighted by the inverse of its probability of being sampled.
The last paragraph means that the model can be estimated with valid equations BUT the variance-covariance matrix won't be but if I consider the inverse of sampling probability.
In Stata it is possible to run a regression by doing reg y x1 x2 [pweight=n] and that calculates the right variance-covariance matrix considering the inverse of sampling probability. At the time I have to use Stata for some part of my work and R for others. I'd like to use just R.

You can do this by repeating the rownames:
df1 <- df[rep(row.names(df), df$FACT), 1:5]
> head(df1)
age gender labour_situation industry_code FACT
1 35 M unemployed 15 1510
1.1 35 M unemployed 15 1510
1.2 35 M unemployed 15 1510
1.3 35 M unemployed 15 1510
1.4 35 M unemployed 15 1510
1.5 35 M unemployed 15 1510
> tail(df1)
age gender labour_situation industry_code FACT
2.781 21 F inactive 0 787
2.782 21 F inactive 0 787
2.783 21 F inactive 0 787
2.784 21 F inactive 0 787
2.785 21 F inactive 0 787
2.786 21 F inactive 0 787
here 1:5 refers to the columns to keep. If you leave that bit blank, all will be returned.

Related

R: Create sample with at least one element from each category

For linear regression to predict house prices, I need to make train and test sample of 80% and 20% proportion.
However, some of the variables are factors of which few have just 1 observation under them.
Due to this, when performing random sampling, those factors are in test sample and not in train sample.
Hence when predicting the Sale Price in test set, the error comes:
"Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor Exterior1st has new levels ImStucc"
Here is the summary of the train sample of Exterior1st variable:
> summary(train$Exterior1st)
AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd Plywood Stone Stucco
11 0 1 36 0 41 173 0 164 78 2 17
VinylSd Wd Sdng WdShing
389 140 17
Here is summary of the test sample of Exterior1st variable:
> summary(test$Exterior1st)
AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd Plywood Stone Stucco
4 0 0 8 1 11 37 1 37 22 0 4
VinylSd Wd Sdng WdShing
97 43 3
As you can see the ImStucc factor in this variable is present in the train sample but not in the test sample, due to which the predict function is throws the initial mentioned error.
In my pursuit for this solution, I had come across a function called "stratified".
But that function does not seem to work in R.
There was another solution using dplyr group_by. But here we have to specify the number of observations for each group. This solution is not suitable for this dataset as it would require calculation for each factor.
Another solution provided was for sampling of vector alone and not the data frame. Hence, that solution does not help.
t= sample(c(filtered_data$Exterior1st,sample(filtered_data$Exterior1st,size = 1000, replace = TRUE)))
> table(t)
t
1 3 4 5 6 7 8 9 10 11 12 13 14 15
26 2 74 2 91 375 1 345 168 3 37 848 329 36
The above sampling gives a total of 2337 entries, even though size given is 1000. Hence, this is perhaps not what I'm looking for.
Is there method to create a sample of 80% of the data such that at least 1 factor from each variable is present within this sample.
If there isn't, what is the workaround this situation?
Maybe I am misreading, but if you only have 1 observation of a categorical variable, you won't be able to use that factor, lmStucc, in a regression.
I would remove that variable from the model, collect more data, or aggregate it with other factors (if possible). (I would probably not include 2, 5, or 11 either - from the table t, because they also have low observations)
Also, the function sample (when replacement = TRUE) will choose the same observation multiple times. Set it to replacement = FALSE to avoid duplication of entries.

Issue in creating contingency table in R

I am using ISLR package for my statistics practice. I am using OJ dataset. I am trying to create a contingency table for Purchase column and specialPrice columns for weach of the population.
I am trying to find the likelihood of CH being sold if there is a special price.
Here is my code so far.
library(ISLR)
CH <- table(OJ[OJ$Purchase == 'CH', "SpecialCH"])
MM <- table(OJ[OJ$Purchase == 'MM', "SpecialMM"])
table (MM, CH)
The out put that I get is a bit weird.
CH
MM 121 532
101 1 0
316 0 1
I am trying to find the odds ration and eventually apply McNemar's test. But I am unable to generate the contingency table. I can do it by hand but need to do it in R.
You are trying to work with 3 variables, but a contingency table only uses 2. I recommend using xtabs since the formula method saves some typing and it does a better job of labeling the table:
xtabs(~SpecialMM+SpecialCH, OJ) # Only 4 weeks are both on special
# SpecialCH
# SpecialMM 0 1
# 0 743 154
# 1 169 4
xtabs(~Purchase+SpecialCH, OJ) # When CH is on special ca 75% CH
# SpecialCH
# Purchase 0 1
# CH 532 121
# MM 380 37
# xtabs(~Purchase+SpecialMM, OJ) # When MM is on special ca 58% MM
# SpecialMM
# Purchase 0 1
# CH 581 72
# MM 316 101
The first table asks the question. Are specials for one brand associated with the other brand. There are 1070 purchases of OJ represented. CH was on special 158 times and MM was on special 173 times. But only 4 times are both brands on special. This table suggests that MM and CH are not on special at the same time. You could use Chi Square or another test to see if that is a significant deviation from random assignment of specials.
The second and third tables look at purchase of OJ to see if one brand is more likely to be purchased relative to the other brand when it is on sale. Notice that most OJ purchases occur when neither is on sale, but it could be that sales boost the purchase of the brand on sale. Again the statistical tests would tell you if this could just be random chance or unlikely to be chance.

Observations with low frequency go all in train set and produce error in predict ()

I have a dataset (~14410 rows) with observations including the country. I divide this set into train and test set and train my data using decision tree with the rpart() function. When it comes to predicting, sometimes I get the error that test set has countries which are not in train set.
At first I excluded/deleted the countries which appeared only once:
# Get orderland with frequency one
var.names <- names(table(mydata1$country))[table(mydata1$country) == 1]
loss <- match(var.names, mydata1$country)
names(which(table(mydata1$country) == 1))
mydata1 <- mydata1[-loss, ]
When rerunning my code, I get the same error at the same code line, saying that I have new countries in test which are not in train.
Now I did a count to see how often a country appears.
count <- as.data.frame(count(mydata1, vars=mydata1$country))
count[rev(order(count$n)),]
vars n
3 Bundesrep. Deutschland 7616
9 Grossbritannien 1436
12 Italien 930
2 Belgien 731
22 Schweden 611
23 Schweiz 590
13 Japan 587
19 Oesterreich 449
17 Niederlande 354
8 Frankreich 276
18 Norwegen 238
7 Finnland 130
21 Portugal 105
5 Daenemark 65
26 Spanien 57
4 China 55
20 Polen 51
27 Taiwan 31
14 Korea Süd 30
11 Irland 26
29 Tschechien 13
16 Litauen 9
10 Hong Kong 7
30 <NA> 3
6 Estland 3
24 Serbien 2
1 Australien 2
28 Thailand 1
25 Singapur 1
15 Kroatien 1
From this I can see, I also have NA's in my data.
My question now is, how can I proceed with this problem?
Should I exclude/delete all countries with e.g. observations < 7 or should I take the data with observations < 7 and reproduce/repeat this data two times, so my predict () function will always work, also for other data sets?
It's somehow not "fancy" just to delete the rows...is there any other possibility?
You need to convert every chr variable in factor:
mydata1$country <- as.factor(mydata1$country)
Then you can simply proceed with train/test splitting. You won't need to remove anything (except NAs)
By using the type factor, your model will know that an observation country, will have some possible levels:
Example:
country <- factor("Italy", levels = c("Italy", "USA", "UK")) # just 3 levels for example
country
[1] Italy
Levels: Italy USA UK
# note that as.factor() takes care of defining the levels for you
See the difference with:
country <- "Italy"
country
[1] "Italy"
By using factor, the model will know all the possible levels. Because of this, even if in the train data you won't have an observation "Italy", the model will know that it's possible to have it in the test data.
factor is always the correct type for characters in models.

R. Compare magnitud of response over time among many individuals

I have 1000 individuals with measurements of plasmatic HDL levels over the last 15 years. The number of measurements per individual ranges from 1 to 15, and were not taken in the same dates. I would like to compare the HDL levels over time in the individuals and make a ranking of those individuals whose HDL levels are consistently the highest over time. There must be some sort of linear regression method for measuring this effect, but I am not acquainted with it. One concern is what is the minimum number of measurements per individual that I should consider in order to include/exclude certain individuals in the analysis.
I have a data.frame with 4 columns: Individual_ID, HDL_level, Age (at the time of measurement) and Sex. This is a data.frame with a toy example:
Individual_ID HDL_level Age (years old) Sex
1 50 12.3 M
1 52.1 15.4 M
1 55.3 17.1 M
2 45 12.1 M
2 46.3 12.1 M
3 60 14.3 F
3 55 16.2 F
... ... ... ...
I would like to do 2 rankings using R: 1) male and female considered together, NO adjustment for sex; and 2) male and female considered together, adjusting for sex. In both of them, I would like to adjust for diet (e.g., whether or not they ingested food during the 3 last hours before the measurements were taken). Thank you in advance.

'Forward' cumulative sum in dplyr

When examining datasets from longitudinal studies, I commonly get results like this from a dplyr analysis chain from the raw data:
df = data.frame(n_sessions=c(1,2,3,4,5), n_people=c(59,89,30,23,4))
i.e. a count of how many participants have completed a certain number of assessments at this point in time.
Although it is useful to know how many people have completed exactly n sessions, we more often need to know how many have completed at least n sessions. As per the table below, a standard cumulative sum isn't appropriate, What we want are the values in the n_total column, which is a sort of "forwards cumulative sum" of the values in the n_people column. i.e. the value in each row should be the sum of the values of itself and all values beyond it, rather than the standard cumulative sum, which is the sum of all values up to and including itself:
n_sessions n_people n_total cumsum
1 59 205 59
2 89 146 148
3 30 57 178
4 23 27 201
5 4 4 205
Generating the cumulative sum is simple:
mutate(df, cumsum = cumsum(n_people))
What would be an expression for generating a "forwards cumulative sum" that could be incorporated in a dplyr analysis chain? I'm guessing that cumsum would need to be applied to n_people after sorting by n_sessions descending, but can't quite get my head around how to get the answer while preserving the original order of the data frame.
You can take a cumulative sum of the reversed vector, then reverse that result. The built-in rev function is helpful here:
mutate(df, rev_cumsum = rev(cumsum(rev(n_people))))
For example, on your data this returns:
n_sessions n_people rev_cumsum
1 1 59 205
2 2 89 146
3 3 30 57
4 4 23 27
5 5 4 4

Resources