Performing PCA on opinion variables to create a left right placement variables trouble with interpretations - graph

I work on 4 different countries represented in a database N=2500: France, Germany, Australia. For each country I have opinion variables on political candidates. These variables: scale_ "candidate's name", are coded in the following way a value of 1 is equal to a very negative opinion and 10 to a very positive opinion.
The individuals in my database are not all from the same country, so the number of missing values for each of these variables includes both individuals from other countries and individuals from the country who did not answer the question about the candidate.
I would like to create a left-right placement variable for all these individuals from my opinion variables. The ultimate goal is to implement this new left right palcement variable in my general dataset D.
I'm trying to do a PCA but I'm running into interpretation difficulties and I'm a bit stuck.
As of now I figured it would be easyer to work country by country hence I divided my dataset in 4.
# Dividing te dataset in 4
french_data<-subset(D, D$cntry == "France")
german_data<-subset(D, D$cntry== "Germany")
aust_data<- subset(D, D$cntry== "Austria")
dutch_data<- subset(D,D$cntry== "Netherlands")
# I will take my french sample as an example:
# Isolate the variables and deal with NAs
french_data <- french_data[, c("scale_Faure", "scale_Macron", "scale_ Le Pen",
"scale_Jacob", "scale_JLM", "scale_ Jadot")]
# Replace them with the column mean
french_data<-french_data %>%
mutate_if(is.numeric, ~replace_na(.,mean(., na.rm = TRUE)))
view(french_data)
# Perform PCA
acpfr<-prcomp(french_data, center = TRUE, scale. = TRUE)
# PCA coefficients
print(acpfr$rotation)
PC1 PC2 PC3 PC4
scale_Faure 0.53843404 -0.2250947 0.23983503 0.2561127
scale_Macron 0.21404458 0.4849783 -0.66993526 0.3172137
scale_ Le Pen 0.05153401 0.6750354 0.55336030 -0.3427580
scale_Jacob 0.45431307 0.4062040 0.04222295 0.1604051
scale_JLM 0.39125881 -0.1228762 -0.37745970 -0.8235126
scale_ Jadot 0.54966729 -0.2799203 0.20784534 0.1113373
PC5 PC6
scale_Faure -0.3028698 0.66676571
scale_Macron -0.4050550 -0.07398445
scale_ Le Pen -0.3415634 -0.03626538
scale_Jacob 0.7739880 0.04503232
scale_JLM 0.0200508 0.10376513
scale_ Jadot -0.1675569 -0.73201058
I am left with those numbers which I have trouble to intepret first.
Second I am kind of lost as to what the next steps are considering my goal. What should I do next ?
Thanks in advance for your help.

Related

PCA - All variables with same signal on PC1 coordinates

So, I am analyzing a dataset that consists of 160 observations and 20 variables and am performing a PCA. It is about patients affected by a disease and the variables are antibodies levels measured in the same experiment and the values are on the same units (u/mL). These variables are all positive values so I can't understand how I would have samples on the positive PC1 side of the plot without any variable contributing to that side (given that there are no negative values involved on these variables).
For confounding factors, what I have is: patients' age, gender and the duration of infection, but these 3 were not added in the PC analysis.
I am having some trouble to understand the following: when using the rpackage factoextra's function fviz_pca_biplot() to see both the sample distribution as well as each variable contribution to PCs 1 and 2, I realized that my 20 variables have high negative value for PC1.
For the following images, I generated them using a small sample of my original data and, eventhough the variables contribution are not the same, they are still highly negative for PC1. This is understandable if I do not center my data in the prcomp() function (image 1) as it is possible to see that all of my samples are on the negative side of the PC1 component and it explains most of the data inertia.
library(factoextra)
PCAf <- read.table("PCA_small_sample.csv", sep = ";", header = T, row.names = 1)
res.pca <- prcomp(PCAf, scale = TRUE, center = F)
fviz_pca_biplot(res.pca)
However, I have been taught that it is necessary to center the data when performing PCA and the image becomes like this:
res.pca <- prcomp(PCAf, scale = TRUE)
fviz_pca_biplot(res.pca)
This diminishes PC1 explained variance and increases PC2 but, eventhough it changes the variables coordinates, there is no positive coord to PC1.
res.var <- get_pca_var(res.pca)
res.var$coord
These are the values for the non centered PCA:
And for the centered PCA:
Am I doing something wrong, should I really present my analysis with the second image eventhough the vectors do not match what we are seeing?
My main question is: When presenting the PCA, it is better to do so with the centralized data, right? Then, should I perform some sort of correction to the variables' coordinates/contribution to the PCs? Because this second image does not seem too reliable to me, but this may be due to lack of experience... I mean, since all variables are going toward the left side of the plot, what would be pulling some of the samples (e.g. 7,10,8,4,20) towards the right side of the plot (positive PC1)? It seems counterintuitive that there isn't even a single vector on the right side.
This also brings me the question: Should I add confounding factors when performing a PCA? I performed linear regression to account for them but did not include them in the PC analysis.
Anyway, thank you all so much in advance.
PS: I uploaded a file containing a sample of my data, code and images on github
PS2: When plotting this with a generic dataset, I do not see the same issue. At first it happens but when centering the data, there are vectors on the four quadrants, for which I am able to extract some rationale.
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
paste("wt", 1:5, sep=""),
paste("ko", 1:5, sep=""))
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))
data.matrix[i,] <- c(wt.values, ko.values)
}
PCAf <- t(data.matrix)
res.pca_NC <- prcomp(PCAf, scale = TRUE, center = F)
res.pca_C <- prcomp(PCAf, scale = TRUE, center = T)
fviz_pca_biplot(res.pca_NC)
fviz_pca_biplot(res.pca_C)
Not centered - generic PCA:
Centered - generic PCA:

R: Find cutoffpoint for continous variable to assign observations to two groups

I have the following data
Species <- c(rep('A', 47), rep('B', 23))
Value<- c(3.8711, 3.6961, 3.9984, 3.8641, 4.0863, 4.0531, 3.9164, 3.8420, 3.7023, 3.9764, 4.0504, 4.2305,
4.1365, 4.1230, 3.9840, 3.9297, 3.9945, 4.0057, 4.2313, 3.7135, 4.3070, 3.6123, 4.0383, 3.9151,
4.0561, 4.0430, 3.9178, 4.0980, 3.8557, 4.0766, 4.3301, 3.9102, 4.2516, 4.3453, 4.3008, 4.0020,
3.9336, 3.5693, 4.0475, 3.8697, 4.1418, 4.0914, 4.2086, 4.1344, 4.2734, 3.6387, 2.4088, 3.8016,
3.7439, 3.8328, 4.0293, 3.9398, 3.9104, 3.9008, 3.7805, 3.8668, 3.9254, 3.7980, 3.7766, 3.7275,
3.8680, 3.6597, 3.7348, 3.7357, 3.9617, 3.8238, 3.8211, 3.4176, 3.7910, 4.0617)
D<-data.frame(Species,Value)
I have the two species A and B and want to find out which is the best cutoffpoint for value to determine the species.
I found the following question:
R: Determine the threshold that maximally separates two groups based on a continuous variable?
and followed the accepted answer to find the best value with the dose.p function from the MASS package. I have several similar values and it worked for them, but not for the one given above (which is also the reason why i needed to include all 70 observations here).
D$Species_b<-ifelse(D$Species=="A",0,1)
my.glm<-glm(Species_b~Value, data = D, family = binomial)
dose.p(my.glm,p=0.5)
gives me 3.633957 as threshold:
Dose SE
p = 0.5: 3.633957 0.1755291
this results in 45 correct assignments. however, if I look at the data, it is obvious that this is not the best value. By trial and error I found that 3.8 gives me 50 correct assignments, which is obviously better.
Why does the function work for other values, but not for this one? Am I missing an obvious mistake? Or is there maybe a different/ better approach to solving my problem? I have several values I need to do this for, so I really do not want to just randomly test values until I find the best one.
Any help would be greatly appreciated.
I would typically use a receiver operating characteristic curve (ROC) for this type of analysis. This allows a visual and numerical assessment of how the sensitivity and specificity of your cutoff changes as you adjust your threshold. This allows you to select the optimum threshold based on when the overall accuracy is optimum. For example, using pROC:
library(pROC)
species_roc <- roc(D$Species, D$Value)
We can get a measure of how good a discriminator Value is for predicting Species by examining the area under the curve:
auc(species_roc)
#> Area under the curve: 0.778
plot(species_roc)
and we can find out the optimum cut-off threshold like this:
coords(species_roc, x = "best")
#> threshold specificity sensitivity
#> 1 3.96905 0.6170213 0.9130435
We see that this threshold correctly identifies 50 cases:
table(Actual = D$Species, Predicted = c("A", "B")[1 + (D$Value < 3.96905)])
#> Predicted
#> Actual A B
#> A 29 18
#> B 2 21

Clustering leads to very concentrated clusters

To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8
Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.
Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.
The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.
With help from another one, I managed to create a distance matrix and use the PAM clustering.
Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.
Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.
Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?
The whole code and results:
#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1
Now using the PAM clustering
dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))
But I get very concentrated clusters, as:
1 2 3 4
382 100 23 62
I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.
dim(df) # [1] 7659 3
test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318 3
length(unique(test$V1)) # 567
length(unique(test$V2)) # 567
test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567
#Mayo, forget what the data scientist said about PAM. Since you've mentioned this work is for a thesis. Then from an academic viewpoint, your current justification to why PAM is required, does not hold any merit. Essentially, you need to either prove or justify why PAM is a necessity for your case study. And given the nature of (continuous) variables in the dataset, V1, V2, N, I do not see the logic on why PAM is applicable here (like I mentioned in the comments, PAM works best for mixed variables).
Continuing further, See this post on correlation detection in R;
# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables.
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.
#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg"
removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32 4
Hope this helps.

Need help applying regression model to dataset in R (sports data)

Update: Solved!
I'm currently trying to create a regression model for football that predicts a team's total points based on their pass yards and rush yards. I was able to get all the way to figuring out the regression equation but from here I do not know how to "plug in" the formula.
The data table is essentially all 32 NFL teams listed in rows and their offensive stats listed in columns
Code:
# 1. Import
Offense <- read.csv(file.choose(), header=TRUE)
#2 View
show (Offense)
#3 Attach so headers can be referenced
attach (Offense)
#4 Create Regression Model
mod1 <-lm(Total.Points ~ Pass.Yds + Rush.Yds)
summary(mod1)
#Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#Plug in the Regression Equation
predict(mod1)
Output: https://imgur.com/a/AbTNF
I see that at the end it applied the regression equation to all 32 rows, but how do I
get it to display in a ranked list
get it to display, say, the team name as well as the projected score (so I don't have to wonder what team "1" or "2" refer to
Since I have the equation, could I also just write a loop function that ran the equation for every row of data I have and print the results?
I'm a beginner so much appreciated!
Update: Came up with this
####Part 2. Interpretation
#1. Examining quality of model
summary(mod1)
cor(Pass.Yds, Rush.Yds)
#2. Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#3. Predicted Points (Descending Order)
proj <- sort(predict(mod1), decreasing = TRUE)
proj
#4. Corresponding Name (Descending)
name <- Team[order(predict(mod1), decreasing = TRUE)]
name
#Data Frame
Projections <- data.frame(name, proj)
Projections
While bbrot provided a much simpler version
Assuming that Teams is the vector of team names, something like cbind(Teams[order(predict(mod1), decreasing = TRUE)], sort(predict(mod1), decreasing = TRUE)) should do...
Edit: Your Teams vector seems to be a factor. In this case, the following commands are going to work:
# returns a character matrix
cbind(as.character(Teams)[order(predict(mod1), decreasing = TRUE)],
sort(predict(mod1), decreasing = TRUE))
# returns a data frame
data.frame(Teams = Teams[order(predict(mod1), decreasing = TRUE)],
Points = sort(predict(mod1), decreasing = TRUE))

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

Resources