I am going to winsorize my dataset to get rid of some outliers with the package robustHD. It is the first time I ran into this error. The dataset contains 50+ variables and 100+ observations.
How can I fix this? And why matrix singularity matters for a calculation like winsorize? Thanks.
df_win<-winsorize(df,prob=0.95)
Error in solve.default(R) : system is computationally singular: reciprocal condition number = 1.26103e-18
The reason for this is that winsorize in robustHD uses solve. If you look deeper into the code, winsorize on a data frame calls the winsorize.data.frame method, which is simply a script that runs as.matrix and then uses the winsorize.matrix method. This in turns does a bunch of things, but the problem here is that it uses the solve function.
The error you get is from solve. The error probably occurs because you included some variables/columns that are very highly correlated, or rather, they are linear combinations of each other. You may want to check if you have duplicated variables or variables that are transformations of each other.
There are several things you can do:
Remove one of the highly correlated variables and try again.
Check out a different package to use winsorize from.
Write your own winsorize function.
The quickest way to do the second step:
require(sos)
findFn("winsorize")
This will produce an overview of all functions that have the word "winsorize" in their description. Just look for functions that are described to be used for winsorization.
Related
I am working with the 'indicspecies' package - multipatt function and am unable to extract summary values of the package. Unfortunately I can't print all the summary and am left with impartial information for my model. The reason is the huge amount of data that needs to be printed from the summary (300.000 different species, 3 groups, 6 comparable combinations).
This is what happens with summary being saved (pre-code incl.):
x <- multipatt(data, ...)
sumx <-summary(x)
sumx
NULL
str(sumx)
NULL
So, the summary does not work exactly like a generic summary. It seems that the function is based around the older indval function from the 'labdsv' package (which is mentioned in the documentation). I found an archived thread where a similar problem is discussed: http://r.789695.n4.nabble.com/extract-values-from-summary-of-function-indval-of-the-package-labdsv-td4637466.html
but it seems not resolved (and is not exactly about the same function, rather the base function indval).
I was wondering if anyone has experience with the indicspecies package and knows a way to either extract the info from the summary.
It is possible to extract significance and other information from the other saved data from the model, but it might be nice to just get a quick complete overview from the data.
ps. I tried
options(max.print=1000000)
but this didn't solve it for me.
I use to capture the summary output for a multipatt object, but don't any more because the p-values reported are not corrected for multiple testing. To answer the OP's question you can capture the summary output using capture.output
ex.
dat.multipatt.summary<-capture.output(summary(dat.multipatt, indvalcomp=TRUE))
Again, I do not recommend this. It is very important to correct the p-values for multiple testing, so the summary output actually isn't helpful. To be clear ?multipatt states:
"sign Data table with results of the best matching pattern, the association value and the degree of statistical significance of the association (i.e. p-values from permutation test). Note that p-values are not corrected for multiple testing."
I just posted an answer for how to correct the p-values here https://stats.stackexchange.com/questions/370724/indiscpecies-multipatt-and-overcoming-multi-comparrisons/401277#401277
I don't have any experience with this package and since you haven't provided the data, it's difficult to reproduce. But since summary is returning NULL, are you sure your x is computed properly? Check the object.size or class or something else of x to see if it indeed has any content.
Also instead of accessing all the contents of summary(x) together, you can use # to access slots of it (similar to $ in dataframe).
If you need further assistance, it'd be better t provide atleast a small subset or some other sample data so that the community can work with it.
I have a dataset where I've fitted a linear model and I've tried to use the step function on this linear model. I get an error message "saying number of rows in use has changed: remove missing values?".
I noticed that a few of the observations (not many) in my dataset had NA values for one variable. I've seen similar questions which suggest using na.omit(), but when I do this I lose the observations. I want to keep the observations however, because they contain useful information for the other variables. Is there a way to use step and avoid losing the observations?
You can call the nobs function to check that the number of observations is unchanged, and its use.fallback argument to potentially guess the missing values. The R documentation however recommends omitting the relevant data before running step.
I would discourage you from simply omitting the missing values if they are indeed really missing. You can use multiple imputation via Amelia to impute the data such that you have a full dataset.
see here: https://cran.r-project.org/web/packages/Amelia/Amelia.pdf
also I would recommend reviewing the book "Statistical Analysis With Missing Data" by R. Little and D.B. Rubin.
This might not be the right place to ask but I'm not sure where else to ask it. I'm trying to use the smbinning package. In particular, I'm trying to bin by multiple predictor variables. The issue is all the examples in the package documentation only deal with one predictor variable. I tried this naively:
result=smbinning(df=training,y="FlagGB",x=".,",p=.05)
which seemed to execute okay, but then if I tried to run result$ivtable I got the error
Error in result$ivtable : $ operator is invalid for atomic vectors
Does anyone know a) how to get smbinning to accept multiple predictors or if it can't another package that can; b) how to resolve the specific error listed above?
I have solved the problem ,It is because the training may not a data frame, you have to convert training into data frame with as.data.frame(training). you can see the smbinning code (https://github.com/cran/smbinning/blob/master/R/smbinning.R#L490), there is this block
i=which(names(df)==y) # Find Column for dependant
j=which(names(df)==x) # Find Column for independant
if (!is.numeric(df[,i]))
{
return("Target (y) not found or it is not numeric")
}
secondly,the y FlagGB must be numerical ,if your y varible is factor ,you have to convert to numerical ,you can use as.numeric(as.character(y)) not directly use as.numerical()
the problem is similarly to "Target (y) not found or it is not numeric" -Package smbinning - R
Have you looked into "Information" package? It seems to be doing the job, but there is no facility to recode the variable. Of if there is one, I haven't been able to find. Otherwise, it is a really great package for exploration and analysis of the variables.
To answer b) you should do: result and (most probably) see that the function in fact did not execute for the specific reason that you will get in return.
Indeed, it is a bit confusing that the smbinning package returns its errors silently and within the variable itself.
Question a), on the other hand, is hard to answer without looking at the data. You can try to cross/multiply your variables, but that may result in a very large number of factor levels. I would suggest that you apply the smbinnign package to group each of your characteristics into a few groups and then try to cross the groups.
for question a), you should use sumiv method which can calculates IV for all variables in one step. code like:
sumivt=smbinning.sumiv(chileancredit.train,y="FlagGB")
sumivt # Display table with IV by characteristic
I have this script which does a simple PCA analysis on number of variables and at the end attaches two coordinates and two other columns(presence, NZ_Field) to the output file. I have done this many times before but now its giving me this error:
I understand that it means there are negative eigenvalues. I looked at similar posts which suggest to use na.omit but it didn't work.
I have uploaded the "biodata.Rdata" file here:
covariance matrix is not non-negative definite
https://www.dropbox.com/s/1ex2z72lilxe16l/biodata.rdata?dl=0
I am pretty sure it is not because of missing values in data because I have used the same data with different "presence" and "NZ_Field" column.
Any help is highly appreciated.
load("biodata.rdata")
#save data separately
coords=biodata[,1:2]
biovars=biodata[,3:21]
presence=biodata[,22]
NZ_Field=biodata[,23]
#Do PCA
bpc=princomp(biovars ,cor=TRUE)
#re-attach data with auxiliary data..coordinates, presence and NZ location data
PCresults=cbind(coords, bpc$scores[,1:3], presence, NZ_Field)
write.table(PCresults,file= "hlb_pca_all.txt", sep= ",",row.names=FALSE)
This does appear to be an issue with missing data so there are a few ways to deal with it. One way is to manually do listwise deletion on the data before running the PCA which in your case would be:
biovars<-biovars[complete.cases(biovars),]
The other option is to use another package, specifically psych seems to work well here and you can use principal(biovars), and while the output is bit different it does work using pairwise deletion, so basically it comes down to whether or not you want to use pairwise or listwise deletion. Thanks!
I am trying to use [r] to run plm on two large datasets, one with 400K obs and the other with 1.1 million. I can run the smaller in SAS but the larger doesn't work. I was trying to see if I could use [r] and when I try to run the code below it always comes back as follows:
> pvlag<-read.csv(file="pvlag.csv", sep=",")
> pvpanel<-plm.data(pvlag, c("New_ID", "billmo"))
pv<-plm(usetotl~livgarea+yardarea+poolsize+lagavg+lat1+nonlat1+grad+grad,data=pvpanel, model="random", random.method=("swar"), index=c("New_ID", "billmo"))
series are constants and have been removed Error in
solve.default(crossprod(X.m)) : system is computationally singular:
reciprocal condition number = 6.47315e-22
This happens with both data sets, even though when i run the smaller one in SAS it outputs estimated coefficients etc without issue. Does anyone have any idea why this is happening? Also, since I am running a random effects model why would cosntant values be removed? I thought that was only an issue with fixed effects models?
you used the variable grad twice. it also happens if you are using dummy variables which would produce 1s over the whole sample, say you have two dummy variables, the first one has a 1 for the first 200K, and the second has a one for the second 200K. you can't use both. you have to choose one - but it does not matter which one.
For me, I had fallen into the dummy trap when I got this error. Isn't that your case too?