Using multiple weights for lm() or glm() in R - r

I want to fit a model in R within which I need to apply two weights at the same time. Let's say my model is glm(y ~ x1 + male + East_Germany) where male identifies a respondent's gender and East Germany is also a binary variable checking whether someone lives in East Germany.
Now let's say both females and East Germans are dramatically missrepresented in my data. Assuming that it's not due to a flawed data collection process, I would have to apply two weights. But can I really specify two weights simultaneously like this...? glm(y ~ x1 + male + East_Germany, weight=c("male_wgt","east_wgt"))
My thinking was that applying one weight to the overall data or the overall model already changes up the whole data structure, but I might be wrong. Let's give you an example:
y x1 male east male_weight east_weight
5 4 0 1 5 2
3 2 1 1 1 2
9 7 1 0 1 1
4 8 1 0 1 1
1 3 1 0 1 1
6 4 1 0 1 1
...where male==1 means "male" and male==0 is "female" and we assume that both genders should be represented equally (50%), and where east==1 means "East Germany", east==0 is "West Germany" and here too, let's assume for reasons of simplicity that both should be represented equally. yand x are just random numbers.
I'm wondering how I would apply both weights simultaneously, because if I say "let's count row#1 five times so that females gain more weight", I'm at the same time giving East Germany too much weight (without having even applied east_weight). The reasons is that if we count row#1 five times, we will end up with a new East/West ratio of 6:4. Or am I mistaken?

Related

GLMM of proportions adjusting for group size

I'm trying to investigate if the proportion of muzzle contact(mc) in primates tends to be directed more towards the mother than other group members (Adults or Juveniles). I have data over 5 years in 4 different groups. This is an exemple for 3 different initiators (those initiating the mc):
age1data
initiator
receiver
count
total_init
prop_mc
subgroupsize
group
Aaa
Mother
1
3
0.333
1
1
Aaa
Adult
2
3
0.666
40
1
Aaa
Juvenile
0
3
0
20
1
Hee
Mother
0
2
0
1
1
Hee
Adult
0
2
0
40
1
Hee
Juvenile
2
2
1
20
1
Awa
Mother
2
10
0.2
1
2
Awa
Adult
3
10
0.3
7
2
Awa
Juvenile
5
10
0.5
13
2
count: number of mc directed to an individual belonging to each receiver subgroups
total_init: total number of mc by this individual
subgroupsize: number of individuals within the group that belong to the receiver subgroup (for exemple, each individual has 1 mother but the group1 has 40 adults (other than the mother) and 20 juveniles
This is the model I tried:
glmm_ages <- glmer(((count_init/total_init)/subgroupsize)~receiver + (1|group) + (1|initiator),
data = age1data,
family = binomial)
This gives me this error message:
Error in pwrssUpdate(pp, resp, tol = tolPwrss, GQmat = GQmat, compDev = compDev, :
Downdated VtV is not positive definite
In addition: Warning message:
In eval(family$initialize, rho) : non-integer #successes in a binomial glm!
The model works when I do a simple GLM without group and initiator as random variables but I really think I need to include them.
From what I understand, the error message means that some categories are all 1 or all 0, which is the case when an individual is only recorded muzzle contacting its mother once, for exemple (dependent variable becomes 1/1/1 = 1).
I'm trying to understand what I should do from this thread I found http://bbolker.github.io/mixedmodels-misc/ecostats_chap.html#digression-complete-separation
In this section, I'm not sure how to find the number I should be putting instead of "10"?
newdat <- subset(age1data,
abs(resid(glmm_ages,"pearson"))<10)
I'm also not sure what all this means and how can I figure out what is my own variance and standard deviation in my dataset:
impose zero-mean Normal priors on the fixed effects (a 4 × 4 diagonal matrix with diagonal elements equal to 9, for variances of 9 or standard deviations of 3)
Can anyone help me figure out if I'm doing the right thing and this is the solution for me?
I apologize for the length of this post, I wanted to make sure everything was there, hope it's clear!

How can I specify the carryover in the first period of a two-treatment three-period crossover study (ABB/BAA)?

I have found a lot of information on how to analyze a 2*2 (AB/BA) crossover trial; however, there are fewer materials on how to disentangle the carryover effect when the study is designed in three periods and two sequences (ABB/BAA). It is worth mentioning that A and B are the treatments and there have been wash-out phases between the three periods.
As sample data, I would like to use the bioequivalence data from "daewr" library.
library("daewr")
data(bioequiv)
head(bioequiv)
Group Subject Period Treat Carry y
1 1 2 1 A none 112.25
2 1 2 2 B A 106.36
3 1 2 3 B B 88.59
4 1 3 1 A none 153.71
5 1 3 2 B A 150.13
6 1 3 3 B B 151.31
The variable Carry contains lagged information from the previous period's Treatment.
The model below should be able to disentangle the effects, but I don't know how to replace the none in the Carry column. I am not sure how to specify this, or how to check if the carryover effect is negligible.
If we don't replace the none in the Carry column, the undermentioned model faces a problem of multicollinearity.
fit <- lmer(y ~ Period+Treat+Carry+(1|Subject), bioequiv)
anova(fit)
summary(fit)

How to calculate percentages for categorical variables by items?

I have a question about calculating the percentage by items and time bins. The experiment is like this:
I conduct an eye-tracking experiment. Participants were asked to describe pictures consisting of two areas of interest(AOIs; I name them Agent and Patient). Their eye movements (fixations on the two AOIs) were recorded along the time when they plan their formulation. I worked out a dataset included time information and AOIs as below (The whole time from the picture onset was divided into separate time bins, each time bin 40 ms).
Stimulus Participant AOIs time_bin
1 M1 agent 1
1 M1 patient 2
1 M1 patient 3
1 M1 agent 4
...
1 M2 agent 1
1 M2 agent 2
1 M2 agent 3
1 M2 patient 4
...
1 M3 agent 1
1 M3 agent 2
1 M3 agent 3
1 M3 patient 4
...
2 M1 agent 1
2 M1 agent 2
2 M1 patient 3
2 M1 patient 4
I would like to create a table containing the proportion of one AOI (e.g. agent) by each stimulus of each time bin. It would be like this:
Stimulus time_bin percentage
1 1 20%
1 2 40%
1 3 55%
1 4 60%
...
2 1 30%
2 2 35%
2 3 40%
2 4 45%
I calculate the percentage because I want to do a multilevel analysis (Growth Curve Analysis) investigating the relationship between the dependent variable agent fixation proportion and the independent variable time_bin, as well as with the stimulus as a random effect.
I hope I get my question understood, due to my limited English knowledge.
If you have an idea or a suggestion, that would be a great help!
Using the tidyverse package ecosystem you could try:
library(tidyverse)
df %>%
mutate(percentage = as.integer(AOIs == "agent") ) %>%
group_by(Stimulus, time_bin) %>%
summarise(percentage = mean(percentage))
Note that this will give you ratios in the [0, 1] interval. You still have to convert it to the percentage values by multiplying with 100 and appending "%".

How can loading factors from PCA be used to calculate an index that can be applied for each individual in a data frame in R?

I am using principal component analysis (PCA) based on ~30 variables to compose an index that classifies individuals in 3 different categories (top, middle, bottom) in R.
I have a dataframe of ~2000 individuals with 28 binary and 2 continuous variables.
Now, I would like to use the loading factors from PC1 to construct an
index that classifies my 2000 individuals for these 30 variables in 3 different groups.
Problem: Despite extensive research, I could not find out how to extract the loading factors from PCA_loadings, give each individual a score (based on the loadings of the 30 variables), which would subsequently allow me to rank each individual (for further classification). Does it make sense to display the loading factors in a graph?
I've performed the following steps:
a) Ran a PCA using PCA_outcome <- prcomp(na.omit(df1), scale = T)
b) Extracted the loadings using PCA_loadings <- PCA_outcome$rotation
c) Removed all the variables for which the loading factors were close to 0.
I have considered creating 30 new variable, one for each loading factor, which I would sum up for each binary variable == 1 (though, I am not sure how to proceed with the continuous variables). Consequently, I would assign each individual a score. However, I would not know how to assemble the 30 values from the loading factors to a score for each individual.
R code
df1 <- read.table(text="
educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header=T)
## Run PCA
PCA_outcome <- prcomp(na.omit(df1), scale = T)
## Extract loadings
PCA_loadings <- PCA_outcome$rotation
## Explanation: A-E are 5 of the 2000 individuals and the variables (education, call, house, school, members) represent my 30 variables (binary and continuous).
Expected results:
- Get a rank score for each individual
- Subsequently, assign a category 1-3 to each individual.
I'm not 100% sure what you're asking, but here's an answer to the question I think you're asking.
First of all, PC1 of a PCA won't necessarily provide you with an index of socio-economic status. As explained here, PC1 simply "accounts for as much of the variability in the data as possible". PC1 may well work as a good metric for socio-economic status for your data set, but you'll have to critically examine the loadings and see if this makes sense. Depending on the signs of the loadings, it could be that a very negative PC1 corresponds to a very positive socio-economic status. As I say: look at the results with a critical eye. An explanation of how PC scores are calculated can be found here. Anyway, that's a discussion that belongs on Cross Validated, so let's get to the code.
It sounds like you want to perform the PCA, pull out PC1, and associate it with your original data frame (and merge_ids). If that's your goal, here's a solution.
# Create data frame
df <- read.table(text = "educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header = TRUE)
# Perform PCA
PCA <- prcomp(df[, names(df) != "merge_id"], scale = TRUE, center = TRUE)
# Add PC1
df$PC1 <- PCA$x[, 1]
# Look at new data frame
print(df)
#> educ call house merge_id school members PC1
#> A 1 0 1 12_3 0 0.9 0.1000145
#> B 0 0 0 13_3 1 0.8 1.6610864
#> C 1 1 1 14_3 0 1.1 -0.8882381
#> D 0 0 0 15_3 1 0.8 1.6610864
#> E 1 1 1 16_3 3 3.2 -2.5339491
Created on 2019-05-30 by the reprex package (v0.2.1.9000)
As you say you have to use PCA, I'm assuming this is for a homework question, so I'd recommend reading up on PCA so that you get a feel of what it does and what it's useful for.

Check if a variable is time invariant in R

I tried to search an answer to my question but I find the right answer for Stata (I am using R).
I am using a national survey to study which variables influence the investment in complementary pension (it is voluntary in my country).
The survey is conducted every two years and some individuals are interviewed more than one time. I filtered the df in order to have only the individuals present more than one time trought the filter command. This is an example from the original survey already filtered:
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 1
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
2008 4 1972 F 33000 0
2010 4 1972 F 35000 0
where id is the individual, y.b is year of birth, pens is a dummy which takes value 1 if the individual invests in a complementary pension form.
I wanted to run a FE regression so I load the plm package and then I set the df like this:
df.p <- plm.data(df, c("id", "year")
After this command, I expected that constant variables were deleted but after running this regression:
pan1 <- plm (pens ~ woman + age + I(age^2) + high + medium + north + centre, model="within", effect = "individual", data=dd.p, na.action = na.omit)
(where woman is a variable which takes value 1 if the individual is a woman, high, medium refer to education level and north, centre to geographical regions) and after the command summary(pan1) the variable woman is still present.
At this point I think that there are some mistakes in the survey (for example sex was not insert correctly and so it wasn't the same for the same id), so I tried to find a way to check if for each id, sex is constant.
I tried this code but I am sure it is not correct:
df$x <- ifelse(df$id==df$id & df$sex==df$sex,1,0)
the basic idea shuold be like this:
df$x <- ifelse(df$id=="1" & df$sex=="F",1,0)
but I can't do it manually since the df is composed up to 40k observations.
If you know another way to check if a variable is constant in R I will be glad.
Thank you in advance
I think what you are trying to do is calculate the number of unique values of sex for each id. You are hoping it is 1, but any cases of 2 indicate a transcription error. The way to do this in R is
any(by(df$sex,df$id,function(x) length(unique(x))) > 1)
To break that down, the function length(unique(x)) tells you the number of different unique values in a vector. It's similar to levels for a factor (but not identical, since a factor can have levels not present).
The function by calculates the given function on each subset of df$sex according to df$id. In other words, it calculates length(unique(df$sex)) where df$id is 1, then 2, etc.
Lastly, any(... > 1) checks if any of the results are more than one. If they are, the result will be TRUE (and you can use which instead of any to find which ones). If everything is okay, the result will be FALSE.
We can try with dplyr
Example data:
df=data.frame(year=c(2002,2002,2004,2004,2006,2008,2008,2010),
id=c(1,2,1,2,3,3,4,4),
sex=c("F","M","M","M","M","M","F","F"))
Id 1 is both F and M
library(dplyr)
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))
# A tibble: 4 x 2
id sexes
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 4 1
We can then filter:
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))%>%filter(sexes==2)
# A tibble: 1 x 2
id sexes
<dbl> <int>
1 1 2

Resources