Covariance matrices by group, lots of NA - r

This is a follow up question to my earlier post (covariance matrix by group) regarding a large data set. I have 6 variables (HML, RML, FML, TML, HFD, and BIB) and I am trying to create group specific covariance matrices for them (based on variable Group). However, I have a lot of missing data in these 6 variables (not in Group) and I need to be able to use that data in the analysis - removing or omitting by row is not a good option for this research.
I narrowed the data set down into a matrix of the actual variables of interest with:
>MMatrix = MMatrix2[1:2187,4:10]
This worked fine for calculating a overall covariance matrix with:
>cov(MMatrix, use="pairwise.complete.obs",method="pearson")
So to get this to list the covariance matrices by group, I turned the original data matrix into a data frame (so I could use the $ indicator) with:
>CovDataM <- as.data.frame(MMatrix)
I then used the following suggested code to get covariances by group, but it keeps returning NULL:
>cov.list <- lapply(unique(CovDataM$group),function(x)cov(CovDataM[CovDataM$group==x,-1]))
I figured this was because of my NAs, so I tried adding use = "pairwise.complete.obs" as well as use = "na.or.complete" (when desperate) to the end of the code, and it only returned NULLs. I read somewhere that "pairwise.complete.obs" could only be used if method = "pearson" but the addition of that at the end it didn't make a difference either. I need to get covariance matrices of these variables by group, and with all the available data included, if possible, and I am way stuck.

Here is an example that should get you going:
# Create some fake data
m <- matrix(runif(6000), ncol=6,
dimnames=list(NULL, c('HML', 'RML', 'FML', 'TML', 'HFD', 'BIB')))
# Insert random NAs
m[sample(6000, 500)] <- NA
# Create a factor indicating group levels
grp <- gl(4, 250, labels=paste('group', 1:4))
# Covariance matrices by group
covmats <- by(m, grp, cov, use='pairwise')
The resulting object, covmats, is a list with four elements (in this case), which correspond to the covariance matrices for each of the four groups.

Your problem is that lapply is treating your list oddly. If you run this code (which I hope is pretty much analogous to yours):
CovData <- matrix(1:75, 15)
CovData[3,4] <- NA
CovData[1,3] <- NA
CovData[4,2] <- NA
CovDataM <- data.frame(CovData, "group" = c(rep("a",5),rep("b",5),rep("c",5)))
colnames(CovDataM) <- c("a","b","c","d","e", "group")
lapply(unique(as.character(CovDataM$group)), function(x) print(x))
You can see that lapply is evaluating the list in a different manner than you intend. The NAs don't appear to be the problem. When I run:
by(CovDataM[ ,1:5], CovDataM$group, cov, use = "pairwise.complete.obs", method = "pearson")
It seems to work fine. Hopefully that generalizes to your problem.

Related

Correlation matrix produces 1s in diagonal and NA for the rest

I have a dataframe (Compiled_data) which has 7 columns of numeric data. I wanted to find the correlation between the different columns of data using the cor() function. The code returns a correlation matrix that has 1 in the diagonal while the remaining positions in the correlation matrix are NA.
Column_headers <- c("Country", "Country_code", "Year", "Death.rate",
"Fertility.rate", "Greenhouse.gas", "Mobile.subs",
"Permanent_cropland","Population.density",
"Birth.rate")
I want to explore the interaction between the data in columns "Death.rate" to "Birth.rate"
Death.rate <- c(19.262,19.321,19.120,18.652)
Fertility.rate <- c(6.942,6.928,6.904,6.869)
Greenhouse.gas <- c(107540.6,109807.3,111165.3,110459.4)
Mobile.subs <- c(NA,4,0,0,0)
Permanent.cropland <- c(1.982024,1.982024,1.982024,1.982024)
Population.density <- c(503.4312,511.8361,519.6092,528.0958)
Birth.rate <- c(46.879,46.511,46.117,45.704)
I would also like to exclude NAs and 0s from being considered in the calculation. Any help would be great!
Like Ronak mentioned, you probably have nulls in the data which is interfering with the computation of correlation. You will need to use something for the "use" argument in your correlation function, i.e. "pairwise.complete.obs" to only compare observations where there is data in both. If you want to remove 0s as well, you might want to coerce them to NAs before running the correlation function.
Thanks everyone for the feedback. The following code worked for this:
cordata <- Compiled_dataset[,c(4:10)]
corr <- cor(cordata, use = "pairwise", method = "spearman")

How do you remerge the response variable to the data frame after removing it for standardization?

I have a dataset with 61 columns (60 explanatory variables and 1 response variable).
All the explantory variables all numerical, and the response is categorical (Default).Some of the ex. variables have negative values (financial data), and therefore it seems more sensible to standardize rather than normalize. However, when standardizing using the "apply" function, I have to remove the response variable first, so I do:
model <- read.table......
modelwithnoresponse <- model
modelwithnoresponse$Default <- NULL
means <- apply(modelwithnoresponse,2mean)
standarddeviations <- apply(modelwithnoresponse,2,sd)
modelSTAN <- scale(modelwithnoresponse,center=means,scale=standarddeviations)
So far so good, the data is standardized. However, now I would like to add the response variable back to the "modelSTAN". I've seen some posts on dplyr, merge-functions and rbind, but I couldnt quite get to work so that response would simply be added back as the last column to my "modelSTAN".
Does anyone have a good solution to this, or maybe another workaround to standardize it without removing the response variable first?
I'm quite new to R, as I'm a finance student and took R as an elective..
If you want to add the column model$Default to the modelSTAN data frame, you can do it like this
# assign the column directly
modelSTAN$Default <- model$Default
# or use cbind for columns (rbind is for rows)
modelSTAN <- cbind(modelSTAN, model$Default)
However, you don't need to remove it at all. Here's an alternative:
modelSTAN <- model
## get index of response, here named default
resp <- which(names(modelSTAN) == "default")
## standardize all the non-response columns
means <- colMeans(modelSTAN[-resp])
sds <- apply(modelSTAN[-resp], 2, sd)
modelSTAN[-resp] <- scale(modelSTAN[-resp], center = means, scale = sds)
If you're interested in dplyr:
library(dplyr)
modelSTAN <- model %>%
mutate(across(-all_of("default"), scale))
Note, in the dplyr version I didn't bother saving the original means and SDs, you should still do that if you want to back-transform later. By default, scale will use the mean and sd.

Using apply() function to iterate over different data types doesn't work

I want to write a function that dynamically uses different correlation methods depending on the scale of measure of the feature (continuous, dichotomous, ordinal). The label is always continuous. My idea was to use the apply() function, so iterate over every feature (aka column), check it's scale of measure (numeric, factor with two levels, factor with more than two levels) and then use the appropriate correlation function. Unfortunately my code seems to convert every feature into a character vector and as consequence the condition in the if statement is always false for every column. I don't know why my code is doing this. How can I prevent my code from converting my features to character vectors?
set.seed(42)
foo <- sample(c("x", "y"), 200, replace = T, prob = c(0.7, 0.3))
bar <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.5,0.05,0.1,0.1,0.25))
y <- sample(c(1,2,3,4,5),200,replace = T,prob=c(0.25,0.1,0.1,0.05,0.5))
data <- data.frame(foo,bar,y)
features <- data[, !names(data) %in% 'y']
dyn.corr <- function(x,y){
# print out structure of every column
print(str(x))
# if feature is numeric and has more than two outcomes use corr.test
if(is.numeric(x) & length(unique(x))>2){
result <- corr.test(x,y)[['r']]
} else {
result <- "else"
}
}
result <- apply(features,2,dyn.corr,y)
apply is built for matrices. When you apply to a data frame, the first thing that happens is coercing your data frame to a matrix. A matrix can only have one data type, so all columns of your data are converted to the most general type among them when this happens.
Use sapply or lapply to work with columns of a data frame.
This should work fine (I tried to test, but I don't know what package to load to get the corr.test function.)
result <- sapply(features, dyn.corr, income)

Write script to ignore objects which can’t be found in r

I am trying to construct a script in r to force it to ignore objects it can’t find.
A simplified version of my script is as follows
Trial<-sum(a,b,c,d,e)
A-e are numeric vectors generates by calculating the sum of a column in a data frame.
My problem is I want to use the same script over multiple different conditions (and have far more objects than just a-e). For some of these conditions some of the objects a-e may not exist. Therefore r returns error object d not found.
To avoid having to generate a unique script for each condition I would like to force to ignore any missing objects.
I would be grateful for any help!
Welcome to SO! As mentioned in the comments, in the future try to include a working example in your question. The preferred solution to your problem would be to avoid assigning values to individual variables in the first place. Try to restructure your code so that your column sums get assign to, for example, a list. In the example below, I create some sample data, assign column sum values to a vector, and compute the sum of the vector, without creating a new variable for each column.
# Create sample data
rData <- as.data.frame(matrix(c(1:6), nrow=6, ncol=5, byrow = TRUE))
print(rData)
# Compute column sum
sumVec <- apply(rData, 2, sum)
print(sumVec)
# Compute sum of column sums
total <- sum(sumVec)
print(total)
If you have to use individual variables, before adding them up, you could check if the variable exists, and if not, create it and assign NA. You can then compute the sum of your variables after excluding NA.
# Sample variables
a <- 15
b <- 20
c <- 50
# Assign NA if it doesn't exist (one variable at a time)
if(!exists("d")) { d <- NA }
# Assign NA using sapply (preferred)
sapply(c("a","b","c","d","e"), function(x)
if(!exists(x)) { assign(x, NA, envir=.GlobalEnv) }
)
# Compute sum after excluding NA
altTotal <- sum(na.omit(c(a,b,c,d,e)))
print(altTotal)
Hopefully this will get you closer to the solution!

NA in clustering functions (kmeans, pam, clara). How to associate clusters to original data?

I need to cluster some data and I tried kmeans, pam, and clara with R.
The problem is that my data are in a column of a data frame, and contains NAs.
I used na.omit() to get my clusters. But then how can I associate them with the original data? The functions return a vector of integers without the NAs and they don't retain any information about the original position.
Is there a clever way to associate the clusters to the original observations in the data frame? (or a way to intelligently perform clustering when NAs are present?)
Thanks
The output of kmeans corresponds to the elements of the object passed as argument x. In your case, you omit the NA elements, and so $cluster indicates the cluster that each element of na.omit(x) belongs to.
Here's a simple example:
d <- data.frame(x=runif(100), cluster=NA)
d$x[sample(100, 10)] <- NA
clus <- kmeans(na.omit(d$x), 5)
d$cluster[which(!is.na(d$x))] <- clus$cluster
And in the plot below, colour indicates the cluster that each point belongs to.
plot(d$x, bg=d$cluster, pch=21)
This code works for me, starting with a matrix containing a whole row of NAs:
DF=matrix(rnorm(100), ncol=10)
row.names(DF) <- paste("r", 1:10, sep="")
DF[3,]<-NA
res <- kmeans(na.omit(DF), 3)$cluster
res
DF=cbind(DF, 'clus'=NA)
DF[names(res),][,11] <- res
print(DF[,11])

Resources