correlation matrix of a bunch of categorical variables in R - r

I have about 20 variables about different cities labeled "Y" or "N" and are factors. The variables are like "has co-op" and the such. I want to find some correlations and possibly use the corrplot package to display the connections between all these variables. But for some reason I cannot coerce the variables so that they are read in a way corrplot or even cor() likes so that I can get them in a matrix. I tried:
M <- cor(model.matrix(~.-1,data=mydata[c(25:44)]))
but the results in corrplot came out really weird. Does anyone have a fast way to turn a bunch of Y/N answers into a correlation matrix? Thanks!

You can use the sjp.corr function or sjt.corr function for graphical or tabular output, both from the sjPlot-package.
DF <- data.frame(v1 = sample(c("Y","N"), 100, T),
v2 = sample(c("Y","N"), 100, T),
v3 = sample(c("Y","N"), 100, T),
v4 = sample(c("Y","N"), 100, T),
v5 = sample(c("Y","N"), 100, T))
DF[] <- lapply(DF,as.integer)
library(sjPlot)
sjp.corr(DF)
sjt.corr(DF)
The plot:
The table (in RStudio viewer pane):
You can use many parameters to modify the appearance of the plot or table, see some examples here.

For binary variables, you might consider cross tabs (the table function in R).
However, getting the correlation matrix is pretty straightforward:
# example data
set.seed(1)
DF <- data.frame(x=sample(c("Y","N"),100,T),y=sample(c("Y","N"),100,T))
# how to get correlation
DF[] <- lapply(DF,as.integer)
cor(DF)
# x y
# x 1.0000000 -0.0369479
# y -0.0369479 1.0000000
# visualize it
library(corrplot)
corrplot(cor(DF))
When you convert to integer in this example, "N" is 1 and "Y" is 2. I'm not sure if that holds generally (for R's storage of factors). To have a look at the mapping for your data, try lapply(DF,levels) before converting to integer.
To me, the plot makes sense. If you have questions about the statistical interpretation of correlations in this context, you should consider having a look at http://stats.stackexchange.com

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

Is there a way in R for doing a pairwise-weighted correlation matrix?

I have a survey with a lot of numeric variables (both continuous and dummy-binary) and more than 800 observations. Of course, there is missing data for most of the variables (at a different rate). I need to use a weighted correlation table because some samples represent more population than others. Also, I want to minimize the not used samples, and in this way keep the max. of observations for each pair of variables. I know how to do a pairwise correlation matrix (e.g., cor(data, use="pairwise.complete.obs")). Also I know how to do a weighted correlation matrix (e.g., cov.wt(data %>% select(-weight), wt=data$weight, cor=TRUE)). However, I couldn't find a way (yet) to use both together. Is there a way for doing a pairwise-weighted correlation matrix in R? Super appreciate it if any help or recommendations.
Good question
Here how I do it
It is not fast but faster than looping.
df_correlation is a dataframe with only the variables I want to compute the correlations
and newdf is my original dataframe with the weight and other variables
data_list <- combn(names(df_correlation),2,simplify = FALSE)
data_list <- map(data_list,~c(.,"BalancingWeights"))
dimension <- length(names(df_correlation))
allcorr <- matrix(data =NA,nrow = dimension,ncol = dimension)
row.names(allcorr)<-names(df_correlation)
colnames(allcorr) <- names(df_correlation)
myfunction<- function(data,x,y,weight){
indice <-!(is.na(data[[x]])|is.na(data[[y]]))
return(wCorr::weightedCorr(data[[x]][indice],
data[[y]][indice], method = c("Pearson"),
weights = data[[weight]][indice], ML = FALSE, fast = TRUE))
}
b <- map_dbl(data_list,~myfunction(newdf,.[1],.[2],.[3]))
allcorr[upper.tri(allcorr, diag = FALSE)]<- b
allcorr[lower.tri(allcorr,diag=FALSE)] <- b
view(allcorr)

How to plot weighted survey data and export weighted data from R to SPSS?

I have survey data with a weighting variable called weight that I want to use on my dataset. The survey package unfortunately doesn't work as I was thinking. What I finally need is to
plot a (bar)chart (with percentage of the distribution on y-axis) that considers the weighting but keeps the levels of the factor (on the x-axis).
somehow export the weighted dataset to SPSS.
Is there a way this could work? A solution even with the survey package would be ok as I have no objectives but still couldn't get it to work.
I know there are some posts regarding weighting issues, but I couldn't find a fitting solution to my issue. Thanks for your help.
# create data
surveydata <- as.data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
# change values of columns
surveydata$V1 <- (replicate(1,sample(c(0.5,1,1.5),1000,rep=TRUE)))
surveydata$V2 <- as.factor(sample(3, size = nrow(surveydata), replace = TRUE))
levels(surveydata$V2)[levels(surveydata$V2)=="1"] <- "a"
levels(surveydata$V2)[levels(surveydata$V2)=="2"] <- "b"
levels(surveydata$V2)[levels(surveydata$V2)=="3"] <- "c"
# rename columns
colnames(surveydata)[1] <- "weight"
colnames(surveydata)[2] <- "variable"
With proportions rather than percentages
> library(survey)
> des<-svydesign(id=~1, weights=~weight,data=surveydata)
> barplot(svymean(~variable,des))
With percentages the easiest way is probably to use svytable(), which has an argument for scaling the totals: the code below shows totals, proportions, and percentages
> svytable(~variable,des)
variable
a b c
320.5 365.5 331.5
> svytable(~variable,des,Ntotal=1)
variable
a b c
0.3149877 0.3592138 0.3257985
> svytable(~variable,des,Ntotal=100)
variable
a b c
31.49877 35.92138 32.57985
So
barplot(svytable(~variable,des,Ntotal=100),col="orange",ylab="%")
To transfer the data to SPSS, I would use write.foreign(), which produces a plain-text data file and an SPSS code file to read it in.

Covariance with colinear vectors

I'm trying to calculate the covariance of a matrix which has two colinear vectors. I have read that it was impossible with the "cov" function from R.
Does a different function exist on R to calculate the covariance of a matrix which has two colinear vectors (since it works on Matlab and Excel).
Thank you in advance for your answers
Please consider providing a reproducible example with sample of your data and the corresponding code. Broadly speaking, a covariance matrix can be created with use of the code below:
# Vectors
V1 <- c(1:4)
V2 <- c(4:8)
V3 <- runif(n = 4)
V4 <- runif(n = 4)
#create matrix
M <- cbind(V1,V2, V3, V4)
# Covariance
cov(M)
I'm guessing that you may be getting the following error:
number of rows of result is not a multiple of vector length (arg 1)
You could first try to use the cov function as discussed here.

R: Finding solutions for new x values with nlmrt

Good day,
I have tried to figure this out, but I really can't!! I'll supply an example of my data in R:
x <- c(36,71,106,142,175,210,246,288,357)
y <- c(19.6,20.9,19.8,21.2,17.6,23.6,20.4,18.9,17.2)
table <- data.frame(x,y)
library(nlmrt)
curve <- "y~ a + b*exp(-0.01*x) + (c*x)"
ones <- list(a=1, b=1, c=1)
Then I use wrapnls to fit the curve and to find a solution:
solve <- wrapnls(curve, data=table, start=ones, trace=FALSE)
This is all fine and works for me. Then, using the following, I obtain a prediction of y for each of the x values:
predict(solve)
But how do I find the prediction of y for new x values? For instance:
new_x <- c(10, 30, 50, 70)
I have tried:
predict(solve, new_x)
predict(solve, 10)
It just gives the same output as:
predict(solve)
I really hope someone can help! I know if I use the values of 'solve' for parameters a, b, and c and substitute them into the curve formula with the desired x value that I would be able to this, but I'm wondering if there is a simpler option. Also, without plotting the data first.
Predict requires the new data to be a data.frame with column names that match the variable names used in your model (whether your model has one or many variables). All you need to do is use
predict(solve, data.frame(x=new_x))
# [1] 18.30066 19.21600 19.88409 20.34973
And that will give you a prediction for just those 4 values. It's somewhat unfortunate that any mistakes in specifying the new data results in the fitted values for the original model being returned. An error message probably would have been more useful, but oh well.

Resources