How can loading factors from PCA be used to calculate an index that can be applied for each individual in a data frame in R? - r

I am using principal component analysis (PCA) based on ~30 variables to compose an index that classifies individuals in 3 different categories (top, middle, bottom) in R.
I have a dataframe of ~2000 individuals with 28 binary and 2 continuous variables.
Now, I would like to use the loading factors from PC1 to construct an
index that classifies my 2000 individuals for these 30 variables in 3 different groups.
Problem: Despite extensive research, I could not find out how to extract the loading factors from PCA_loadings, give each individual a score (based on the loadings of the 30 variables), which would subsequently allow me to rank each individual (for further classification). Does it make sense to display the loading factors in a graph?
I've performed the following steps:
a) Ran a PCA using PCA_outcome <- prcomp(na.omit(df1), scale = T)
b) Extracted the loadings using PCA_loadings <- PCA_outcome$rotation
c) Removed all the variables for which the loading factors were close to 0.
I have considered creating 30 new variable, one for each loading factor, which I would sum up for each binary variable == 1 (though, I am not sure how to proceed with the continuous variables). Consequently, I would assign each individual a score. However, I would not know how to assemble the 30 values from the loading factors to a score for each individual.
R code
df1 <- read.table(text="
educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header=T)
## Run PCA
PCA_outcome <- prcomp(na.omit(df1), scale = T)
## Extract loadings
PCA_loadings <- PCA_outcome$rotation
## Explanation: A-E are 5 of the 2000 individuals and the variables (education, call, house, school, members) represent my 30 variables (binary and continuous).
Expected results:
- Get a rank score for each individual
- Subsequently, assign a category 1-3 to each individual.

I'm not 100% sure what you're asking, but here's an answer to the question I think you're asking.
First of all, PC1 of a PCA won't necessarily provide you with an index of socio-economic status. As explained here, PC1 simply "accounts for as much of the variability in the data as possible". PC1 may well work as a good metric for socio-economic status for your data set, but you'll have to critically examine the loadings and see if this makes sense. Depending on the signs of the loadings, it could be that a very negative PC1 corresponds to a very positive socio-economic status. As I say: look at the results with a critical eye. An explanation of how PC scores are calculated can be found here. Anyway, that's a discussion that belongs on Cross Validated, so let's get to the code.
It sounds like you want to perform the PCA, pull out PC1, and associate it with your original data frame (and merge_ids). If that's your goal, here's a solution.
# Create data frame
df <- read.table(text = "educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header = TRUE)
# Perform PCA
PCA <- prcomp(df[, names(df) != "merge_id"], scale = TRUE, center = TRUE)
# Add PC1
df$PC1 <- PCA$x[, 1]
# Look at new data frame
print(df)
#> educ call house merge_id school members PC1
#> A 1 0 1 12_3 0 0.9 0.1000145
#> B 0 0 0 13_3 1 0.8 1.6610864
#> C 1 1 1 14_3 0 1.1 -0.8882381
#> D 0 0 0 15_3 1 0.8 1.6610864
#> E 1 1 1 16_3 3 3.2 -2.5339491
Created on 2019-05-30 by the reprex package (v0.2.1.9000)
As you say you have to use PCA, I'm assuming this is for a homework question, so I'd recommend reading up on PCA so that you get a feel of what it does and what it's useful for.

Related

How can I specify the carryover in the first period of a two-treatment three-period crossover study (ABB/BAA)?

I have found a lot of information on how to analyze a 2*2 (AB/BA) crossover trial; however, there are fewer materials on how to disentangle the carryover effect when the study is designed in three periods and two sequences (ABB/BAA). It is worth mentioning that A and B are the treatments and there have been wash-out phases between the three periods.
As sample data, I would like to use the bioequivalence data from "daewr" library.
library("daewr")
data(bioequiv)
head(bioequiv)
Group Subject Period Treat Carry y
1 1 2 1 A none 112.25
2 1 2 2 B A 106.36
3 1 2 3 B B 88.59
4 1 3 1 A none 153.71
5 1 3 2 B A 150.13
6 1 3 3 B B 151.31
The variable Carry contains lagged information from the previous period's Treatment.
The model below should be able to disentangle the effects, but I don't know how to replace the none in the Carry column. I am not sure how to specify this, or how to check if the carryover effect is negligible.
If we don't replace the none in the Carry column, the undermentioned model faces a problem of multicollinearity.
fit <- lmer(y ~ Period+Treat+Carry+(1|Subject), bioequiv)
anova(fit)
summary(fit)

How to calculate percentages for categorical variables by items?

I have a question about calculating the percentage by items and time bins. The experiment is like this:
I conduct an eye-tracking experiment. Participants were asked to describe pictures consisting of two areas of interest(AOIs; I name them Agent and Patient). Their eye movements (fixations on the two AOIs) were recorded along the time when they plan their formulation. I worked out a dataset included time information and AOIs as below (The whole time from the picture onset was divided into separate time bins, each time bin 40 ms).
Stimulus Participant AOIs time_bin
1 M1 agent 1
1 M1 patient 2
1 M1 patient 3
1 M1 agent 4
...
1 M2 agent 1
1 M2 agent 2
1 M2 agent 3
1 M2 patient 4
...
1 M3 agent 1
1 M3 agent 2
1 M3 agent 3
1 M3 patient 4
...
2 M1 agent 1
2 M1 agent 2
2 M1 patient 3
2 M1 patient 4
I would like to create a table containing the proportion of one AOI (e.g. agent) by each stimulus of each time bin. It would be like this:
Stimulus time_bin percentage
1 1 20%
1 2 40%
1 3 55%
1 4 60%
...
2 1 30%
2 2 35%
2 3 40%
2 4 45%
I calculate the percentage because I want to do a multilevel analysis (Growth Curve Analysis) investigating the relationship between the dependent variable agent fixation proportion and the independent variable time_bin, as well as with the stimulus as a random effect.
I hope I get my question understood, due to my limited English knowledge.
If you have an idea or a suggestion, that would be a great help!
Using the tidyverse package ecosystem you could try:
library(tidyverse)
df %>%
mutate(percentage = as.integer(AOIs == "agent") ) %>%
group_by(Stimulus, time_bin) %>%
summarise(percentage = mean(percentage))
Note that this will give you ratios in the [0, 1] interval. You still have to convert it to the percentage values by multiplying with 100 and appending "%".

Using multiple weights for lm() or glm() in R

I want to fit a model in R within which I need to apply two weights at the same time. Let's say my model is glm(y ~ x1 + male + East_Germany) where male identifies a respondent's gender and East Germany is also a binary variable checking whether someone lives in East Germany.
Now let's say both females and East Germans are dramatically missrepresented in my data. Assuming that it's not due to a flawed data collection process, I would have to apply two weights. But can I really specify two weights simultaneously like this...? glm(y ~ x1 + male + East_Germany, weight=c("male_wgt","east_wgt"))
My thinking was that applying one weight to the overall data or the overall model already changes up the whole data structure, but I might be wrong. Let's give you an example:
y x1 male east male_weight east_weight
5 4 0 1 5 2
3 2 1 1 1 2
9 7 1 0 1 1
4 8 1 0 1 1
1 3 1 0 1 1
6 4 1 0 1 1
...where male==1 means "male" and male==0 is "female" and we assume that both genders should be represented equally (50%), and where east==1 means "East Germany", east==0 is "West Germany" and here too, let's assume for reasons of simplicity that both should be represented equally. yand x are just random numbers.
I'm wondering how I would apply both weights simultaneously, because if I say "let's count row#1 five times so that females gain more weight", I'm at the same time giving East Germany too much weight (without having even applied east_weight). The reasons is that if we count row#1 five times, we will end up with a new East/West ratio of 6:4. Or am I mistaken?

Probability of account win/loss using Bayesian Statistics

I am trying to estimate the probability of winning or losing an account, and I'd like to do this using Bayesian Methods. I'm not really that familiar with these methods, but I think I understand the general idea.
I know some information about losses and wins. Wins are usually characterized by some combination of activities; losses are usually characters by a different combination of activities. I'd like to be able to get some posterior probability of whether or not a new observation will be won or lost based on the current number of activities that are associated with that account.
Here is an example of my data: (This is just a sample for simplicity)
Email Call Callback Outcome
14 9 2 1
3 2 4 0
16 14 2 0
15 1 3 1
5 2 2 0
1 1 0 0
10 3 5 0
2 0 1 0
17 8 4 1
3 15 2 0
17 1 3 0
10 7 5 0
10 2 3 0
8 0 0 1
14 10 3 0
1 9 3 1
5 10 3 1
13 5 1 0
9 4 4 0
So from here I know that 30% of the observations have an outcome of 1 (win) and 70% have an outcome of 0 (loss). Let's say that I want to use the other columns to get a probability of win/loss for a new observation which may have a small number of events (emails, calls, and callbacks) associated with it.
Now let's say that I want to use the counts/proportions of the different events as priors for a new observation. This is where I start getting tripped up. My thinking is to create a dirichlet distribution for wins and losses, so two separate distributions, one for wins and one for losses. Using the counts/proportions of events for each outcome as the priors. I guess I'm not sure how to do this in R. I think my course of action would be estimate a dirichlet distribution (since I have 3 variables) for each outcome using maximum likelihood. I've been trying to use the dirichlet.simul and dirichlet.mle functions from the sirt package in R. I'm not sure if I need to simulate one first?
Another issue is once I have this distribution, it's unclear to me how to get a posterior distribution of a new observation. I've read several papers and can't seem to find a straightforward process on how to do this. (Or maybe there's some holes in my understanding). Any pushes in the right direction would be greatly appreciated.
This is the code I've tried so far:
### FOR WON ACCOUNTS
set.seed(789)
N <- 6
probs <- c(0.535714286, 0.330357143, 0.133928571 )
alpha <- probs
alpha <- matrix( alpha , nrow=N , ncol=length(alpha) , byrow=TRUE )
x <- dirichlet.simul( alpha )
dirichlet.mle(x)
$alpha
[1] 0.3385607 0.2617939 0.1972898
$alpha0
[1] 0.7976444
$xsi
[1] 0.4244507 0.3282088 0.2473405
### FOR LOST ACCOUNTS
set.seed(789)
N2 <- 14
probs2 <- c(0.528037383,0.308411215,0.163551402 )
alpha2 <- probs2
alpha2 <- matrix( alpha2 , nrow=N , ncol=length(alpha2) , byrow=TRUE )
x2 <- dirichlet.simul( alpha2 )
dirichlet.mle(x2)
$alpha
[1] 0.3388486 0.2488771 0.2358043
$alpha0
[1] 0.8235301
$xsi
[1] 0.4114587 0.3022077 0.2863336
Not sure if this is a correct approach or how to get posteriors from here. I realize all the outputs look similar across won/lost accounts. I just used some simulated data to represent what I'm working with.

drawing multiple boxplots from imputed data in R

I have an imputed dataset that I'm analysing, and I'm trying to draw boxplots, but I can't wrap my head around the proper procedure.
my data (a sample, original has 20 observations per imputation and 13 vars per group, all values range from 0 to 25):
.imp .id FTE_RM FTE_PD OMZ_RM OMZ_PD
1 1 25 25 24 24
1 2 4 0 2 6
1 3 11 5 3 2
1 4 12 3 3 3
2 1 20 15 15 15
2 2 4 1 2 3
2 3 0 0 0 6
2 4 20 0 0 0
.imp signifies the imputation round, .id the identifer for each observartion.
I want to draw all the FTE_* variables in a single plot (and the `OMZ_* in another), but wonder what to do with all the imputations, can I just include all values? The imputated data now has 500 observations. With for instance an ANOVA I'd need to average the ANOVA results by 5 to get back to 20 observations. But is this needed for a boxplot as well, since I only deal with medians, means, max. and min.?
Such as:
data_melt <- melt(df[grep("^FTE_", colnames(df))])
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()
I've played a couple of times with ggplot, but consider myself a complete newbie.
I assume you want to keep the identifier for .imp and .id after melting so rather put:
data_melt <- melt(df,c(".imp",".id"))
For completeness of the dataframe it probably helps to introduce a column that identifies the type - FTE vs. OMZ:
data_melt$type <- ifelse(grepl("FTE",data_melt$variable),"FTE","OMZ")
Having this data.frame you can, for example, facet on the type (alternatively you can just use a simple filter statement on data_melt to restrict to one type):
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()+facet_wrap(~type,scales="free_x")
This would look like this.
EDIT: fixed the data mess-up

Resources