SVD with missing values in R - r

I am performing a SVD analysis with R, but I have a matrix with structural NA values. Is it possible to obtain a SVD decomposition in this case? Are there alternative solutions? Thanks in advance

You might want to try out the SVDmiss function in SpatioTemporal package which does missing value imputation as well as computes the SVD on the imputed matrix. Check this link SVDmiss Function
However, you might want to be wary of the nature of your data and whether missing value imputation makes sense in your case.

I have tried using the SVM in R with NA values without succes.
Sometimes they are important in analysis so I usually transform my data as follows:
If you have lots of variables try to reduce their number (clustering, lasso, etc...)
Transform the remaining predictors like this:
- for quantitative variables:
- calculate deciles per predictor (leaving missing obs out)
- calculate frequency of Y per decile (assuming Y is qualitative)
- regroup deciles on their Y freq similarity into 2/3/4 groups
(you can do this by looking at their plot too)
- create for each group a new binary variable
(X11 = 1 if X1 takes values in the interval ...)
- calculate Y frequency for missing obs of that predictor
- join the missing obs category to the variable that has the closest Y freq
- for qualitative variables:
- if you have variables with lots of levels you should do clustering by Y
variable
- for variables with lesser levels, you can calculate Y freq per class
- regroup the classes like above
- calculate the same thing for missing obs and attach it to the most similar
group of non-missing
- recode the variable as for numeric case*
There, now you have a complete database of dummy variables and the chance to perform SVM, neural networks, LASSO, etc...

Related

PCoA function pcoa extract vectors; percentage of variance explained

I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.

r - Estimate selection-unbiased allele frequencies with linear regression systems

I have a few data sets consisting of frequencies for i distinct alleles/SNPs of some populations. Additionally I recorded some factors that are suspicious for having changed the frequencies of these alleles within the populations in the past due to their selectional effect. It is assumed that the selection impact can be described in the form of a simple linear regression for every selection factor.
Now I'd like to estimate how the allele frequencies are expected to be under identical selectional forces (thus, I set selection=1). These new allele frequencies a'_i are derived as
a'_i = a_i - function[a_i|selection=1]
with the current frequency a_i of the allele i of a population and function[a_i|selection=1] as the estimated allele frequency under the absence of selectional forces.
However, there are some constraints for the whole process:
The minimal values of a'_i allowed is 0.
The sum of all allele frequencies a'_i has to be 1.
Usually I'd solve this problem by applying multiple linear regressions. But then the constraints are not fulfilled ...
Any idea how to approach this analysis with constraints (maybe using linear equation/regression systems or structural equation modelling)?
Here is an example data set containing allele frequencies for the ABO major allele groups (p, q, r) as well as the selection variables (x, y, z).
Although this example file only contains 3 alleles and 3 influential variables, all my data sets contain up to ~1050 alleles/SNPs and always 8 selection variables that may have (but don't have to) an impact on the allele frequencies ...
Many thanks in advance for ideas, code snippets and hints!

Survey weights and boostrap wieghts to get counts and CI's

I have a file containing survey data. For example, the file looks like this:
IDNUMBER AGE SEX NumPrescr OnPrescr SURV_WGT BSW1 BSW2....BSW500
123456 22 1 6 1 ... ... ... ...
Here, OnPrescrp is a binary variable indicating whether or not the subjects is on prescription meds and BSW1 - BSW500 are the bootstrap weights and SURV_WGT is the survery weight per subject. There are roughly 20000 entries.
I am tasked with creating tables of various statistics within certain age-gender group breakdowns. For example, how many males from 17 to 24 are on prescription medications. And I need a count N and 95% CI for each of these types of calculations. I'm not familiar at all with survey methods.
From what I understand, I can't just simply add the number of people in each category to get the final count N for each question/category (i.e., cannot just add all the males 17 to 24 who are using prescription meds). Instead, I have to take into account the survery weights and bootstrap weights when constructing my final count N and confidence intervals.
I was then told in STATA this is a one line command:
svyset [pw=SURV_WGT], brr(bsw1-bsw500)
I am working in R however. What is the equivalent command in R and what exactly is the above command doing?
PS: My sample of roughly 20000 indiviudals is a sample of a population of roughly 35 million.
You will want to use the survey package in R. This will be your best friend for weighted/complex survey analysis in R.
install.packages("survey")
The survey package has two main steps to your analysis. The first is creating the svydesign object, which stores information about your survey design including weights, replicate weights, data, etc. Then use any number of analysis functions to run analysis/descriptives on those design objects (e.g., svymean, svyby - for subgroup analysis, svyglm, and many more).
Based on your question, you have survey weights and replicate weights (bootstrapped). While the more common svydesign function is used for surveys with a single set of weights, you want to use svrepdesign, which will allow you to specify survey weights and replicate weights. Check out the documentation, but here is what you can do:
mydesign <- svrepdesign(data = mydata,
weights = ~SURV_WGT,
repweights = "BSW[0-9]+",
type = "bootstrap",
combined.weights = TRUE)
You should read the documentation, but briefly: data will be your data frame, weights takes your single survey weight vector, usually as a formula, repweights is great in that it accepts a regex string that identifies all the replicate weight columns in your data by column name, type tells the design what your replicate weights are (how they were derived), combined.weights is logical for whether the replicate weights contain sampling weights - I assume this is true but it may not be.
From this design object, you can then run analysis. E.g., let's calculate the average number of prescriptions by sex:
myresult <- svyby(~NumPrescr, # variable to pass to function
by = ~SEX, # grouping
design = mydesign, # design object
vartype = "ci", # report variation as confidence interval
FUN = svymean # specify function from survey package, mean here
)
Hope this helps!
EDIT: if you want to look at something by age groups, as you suggest, you need to create a character or factor variable that is coded for each age group and use that new variable in your svyby call.

Lasso, glmnet, preprocessing of the data

Im trying to use the glmnet package to fit a lasso (L1 penalty) on a model with a binary outcome (a logit). My predictors are all binary (they're 1/0 not ordered, ~4000) except for one continuous variable.
I need to convert the predictors into a sparse matrix, since it takes forever and a day otherwise.
My question is: it seems that people are using sparse.model.matrix rather than just converting their matrix into a sparse matrix. Why is that? and do I need to do this here? Outcome is a little different for both methods.
Also, do my factors need to be coded as factors (when it comes to both the outcome and the predictors) or is it sufficient to use the sparse matrix and specify in the glmnet model that the outcome is binomial?
Here's what im doing so far
#Create a random dataset, y is outcome, x_d is all the dummies (10 here for simplicity) and x_c is the cont variable
y<- sample(c(1:0), 200, replace = TRUE)
x_d<- matrix(data= sample(c(1:0), 2000, replace = TRUE), nrow=200, ncol=10)
x_c<- sample(60:90, 200, replace = TRUE)
#FIRST: scale that one cont variable.
scaled<-scale(x_c,center=TRUE, scale=TRUE)
#then predictors together
x<- cbind(x_d, scaled)
#HERE'S MY MAIN QUESTION: What i currently do is:
xt<-Matrix(x , sparse = TRUE)
#then run the cross validation...
cv_lasso_1<-cv.glmnet(xt, y, family="binomial", standardize=FALSE)
#which gives slightly different results from (here the outcome variable is in the x matrix too)
xt<-sparse.model.matrix(data=x, y~.)
#then run CV.
So to sum up my 2 questions are:
1-Do i need to use sparse.model.matrix even if my factors are just binary and not ordered? [and if yes what does it actually do differently from just converting the matrix to a sparse matrix]
2- Do i need to code the binary variables as factors?
the reason i ask that is my dataset is huge. it saves a lot of time to just do it without coding as factors.
I don't think you need a sparse.model.matrix, as all that it really gives you above a regular matrix is expansion of factor terms, and if you're binary already that won't give you anything. You certainly don't need to code as factors, I frequently use glmnet on a regular (non-model) sparse matrix with only 1's. At the end of the day glmnet is a numerical method, so a factor will get converted to a number in the end regardless.

cluster ordinal data

I want to do clustering of my data (kmeans or hclust) in R language (coding). My data is ordinal, which means that the data is Likert scale to measure the causes of cost escalation (I have 41 causes "variables") that scaled from 1 to 5, which 1 is no effect to 5 major effect (I have about 160 observations "who rank the causes")... any help of how to cluster the 41 cause based on the observations ... do I have to convert the scale to percentage or z score before clustering or any thing that help ...... I really need your help!! here is the data to play with https://docs.google.com/spreadsheet/ccc?key=0AlrR2eXjV8nXdGtLdlYzVk01cE96Rzg2NzRpbEZjUFE&usp=sharing
I want to cluster the variables (the columns) in terms of similarity of occurrence in observations... I follow the code in statmethods.net/advstats/cluster.html; but I couldn't cluster the variables (the columns) in terms of similarity of occurrence in observations and also I follow the work at mattpeeples.net/kmeans.html#help; but I don't know why he convert the data to percentage and then to Z-score standardize.
It isn't clear to me if you want to cluster the rows (the observations) in terms of similarity in the variables, or cluster the variables (the columns) in terms of similarity of occurrence in observations?
Anyway, see package cluster. This is a recommended package that comes with all R installations.
Read ?daisy for details of what is done with ordinal data. This metric can be used in functions such as agnes (for hierarchical clustering) or pam (for partitioning about medoids, a more robust version of k-means).
By default, these will cluster the rows/observations. Simply transpose the data object using t() if you want to cluster the columns (variables). Although that may well mess up the data depending on how you have stored them.
Converting the data to percentage is called normalization of data so all the variables are in the range of 0 - 1.
If data is not normalized you run the risk of bias towards dimensions with large values

Resources