Using R, how can I create and index using principal components? - r

I have already done PCA analysis- and obtained three principal components- but I donĀ“t know how to transform these into an index.
I know, for example, in Stata there ir a command " predict index, score" but I am not finding the way to do this in R.
What I want to do is to create a socioeconomic index, from variables such as level of education, internet access, etc, using PCA.
Thank you!

In general, I use the PCA scores as an index. See an example below:
# Load the psych package, you could also use princomp in the stats package
library(psych)
# Example data
df <- data.frame(x1 = rnorm(100, 0, .5)
, x2 = rnorm(100, 0, 1)
, x3 = rnorm(100, .02, 1)
)
# run the PCA
PCA_results <- principal(df, nfactors = 1)
# add our PCA scores as an index
df$index <- PCA_results$scores
You could rescale the scores if you want them to be on a 0-1 scale.

Related

Simulating data between correlated count variable and a continuous variable

Does anybody know how I could possible simulate data with a correlation between a count variable and a continuous variable? Right now the best idea that I have is to just transform the count variable to make it approximately normal, and then to simulate the data using this R code:
set.seed(2018)
x = rnorm(n = 1000, mean = 0, sd = 1)
y = rnorm(n = 1000, mean = .29*x, sqrt(1-.3^2))
cor(x,y)
However, I really think it would be preferable if I could actually make Y a count variable (because they tend to typically be right-skewed). Also, I want to be able to specify specific correlations between x and y. E.g., simulate data with a 0.5 correlation between x and y etc.
Edit: I'm still looking for help!
You can use runif to simulate the continuous variable, then feed the result as the lambda (rate) parameter of rpois:
set.seed(1)
continuous <- runif(100, 0, 10)
counts <- rpois(100, continuous)
plot(continuous, counts)
cor(counts, continuous)
#> [1] 0.7852701
Created on 2020-12-11 by the reprex package (v0.3.0)

Clustering using categorical and continuous data together

I am trying to create a unsupervised model with categorical and continuous data together. I think I have worked it out, but is this the correct way to do this?
Load Libraries
library(tidyr)
library(dummies)
library(fastDummies)
library(cluster)
library(dplyr)
create sample data set
set.seed(3)
sampleData <- data.frame(id = 1:50,
gender = sample(c("Male", "Female"), 10, replace =
TRUE),
age_bracket = sample(c("0-10", "11-30","31-60",">60"),
10, replace = TRUE),
income = rnorm(10, 40, 10),
volume = rnorm(50, 40, 100))
Create sparse matrix and scale
sd1 <- sampleData %>%
dummy_cols(select_columns = c("gender","age_bracket"))%>%
mutate(id = factor(id))%>%
select(-c(gender,age_bracket))%>%
mutate_if(is.numeric, scale)
glimpse(sd1)
Generate a k-means model using the pam() function with a k = 3
sd2 <- pam(sd1, k =3)
Extract the vector of cluster assignments from the model
sd3 <- sd2$cluster
Build the segment_customers dataframe
sd4 <- mutate(sd1, cluster = sd3)
Calculate the size of each cluster
count(sd4, cluster)
Dummy coding of variables is fairly standard, but I am not a fan of it. In many cases this IMHO causes large bias, and hinders interpretability.
In your case, you may additionally be applying standardization to them, which makes variable bias even worse.
Your text claims to use k-means, but uses PAM. These are not the same. PAM is IMHO a better choice here, because of interpretability, and the ability to use other metrics such as Manhattan. The resulting cluster "centers" are data points, not means.
I recommend going down to the mathematical level. PAM tries to minimize the sum of distances to the centers. Now put in the distance you use, e.g., Manhattan. Now substitute the standardization and dummy encoding in there, and you get the actual problem your approach tries to solve. Now have a critical look at this (probably quite large) term: is that helpful for your problem, or are you optimizing the wrong function?

Forecasting multiple variable time series in R

I am trying to forecast three variables using R, but I am running into issues on how to deal with correlation.
The three variables I am trying to forecast are Revenue, Subscriptions and Price.
My initial approach was to do two independent time series forecast of subscriptions and price and multiply the outcomes to generate the revenue forecast.
I wanted to understand if this approach makes sense, as there is an inherent correlation between the price and the subscribers, and this is the part I do not know how to deal with.
# Load packages.
library(forecast)
# Read data
data <- read.csv("data.csv")
data.train <- data[0:57,]
data.test <- data[58:72,]
# Create time series for variables of interest
data.subs <- ts(data.train$subs, start=c(2014,1), frequency = 12)
data.price <- ts(data.train$price, start=c(2014,1), frequency = 12)
#Create model
subs.stlm <- stlm(data.subs)
price.stlm <- stlm(data.price)
#Forecast
subs.pred <- forecast(subs.stlm, h = 15, level = c(0.6, 0.75, 0.9))
price.pred <- forecast(price.stlm, h = 15, level = c(0.6, 0.75, 0.9))
Any help is greatly appreciated!
Looks like you can use the vector autoregression (VAR) model. Take a look at the description and the code provided here:
https://otexts.org/fpp2/VAR.html

Constructing components from PLSR loadings in R

I want to compute the components for a set of variables using the loadings (weights) from a PLSR using the plsr function.
I thought that the components were computed by summing the values of each variable multiplied by the estimated loading (weight).
However, using the output from plsr and doing that doesn't give me the expected values:
Example:
library("pls")
data(oliveoil)
sens.pcr <- plsr(sensory ~ chemical, ncomp = 4, scale = F, data = oliveoil)
Extract loadings/weights:
df <- cbind(sens.pcr$loadings[,1],sens.pcr$loadings[,2],sens.pcr$loadings[,3],sens.pcr$loadings[,4])
One test observation:
firstrow <- oliveoil$chemical[1,]
Extract the components (scores):
scores <- sens.pcr$scores
Do the linear combination:
sum(firstrow*df[,1])
[1] -12.81924
Which is not equal to the first score scores[1,1] = 0.5100166
What is it that I am missing?
Using the sense.pcr$loadings.weigths didn't make any big difference either.

how to create a random loss sample in r using if function

I am working currently on generating some random data for a school project.
I have created a variable in R using a binomial distribution to determine if an observation had a loss yes=1 or not=0.
Afterwards I am trying to generate the loss amount using a random distribution for all observations which already had a loss (=1).
As my loss amount is a percentage it can be anywhere between 0
What Is The Intuition Behind Beta Distribution # stats.stackexchange
In a third step I am looking for an if statement, which combines my two variables.
Please find below my code (which is only working for the Loss_Y_N variable):
Loss_Y_N = rbinom(1000000,1,0.01)
Loss_Amount = dbeta(x, 10, 990, ncp = 0, log = FALSE)
ideally I can combine the two into something like
if(Loss_Y_N=1 then Loss_Amount=dbeta(...) #... is meant to be a random variable with mean=0.15 and should be 0<x=<1
else Loss_Amount=0)
Any input highly appreciated!
Create a vector for your loss proportion. Fill up the elements corresponding to losses with draws from the beta. Tweak the parameters for the beta until you get the desired result.
N <- 100000
loss_indicator <- rbinom(N, 1, 0.1)
loss_prop <- numeric(N)
loss_prop[loss_indicator > 0] <- rbeta(sum(loss_indicator), 10, 990)

Resources