Why do I keep getting "argument is of length zero" error? - r

I'm trying to run a lasso regression on my large dataset but I keep obtaining the following error messages:
**Error in if (is.null(np) | (np[2] <= 1)) stop("x should be a matrix with 2 or more columns") :
argument is of length zero**
**Error in elnet(x, is.sparse, ix, jx, y, weights, offset, type.gaussian, :
(list) object cannot be coerced to type 'double'**
My dataset is information on a travel index (GTI) for determining 'safe' LGBT traveling. I'm trying to use the other variables in the dataset to fit a model to and predict the GTI.
Here is the code I have used thus far:
gaydata <- read.csv(file = 'GayData.csv')
sample of data and headers
names(gaydata)[names(gaydata) == "Total"] <- "GTI"
lasso_1 = glmnet(GTI ~ Anti.Discrimination.Legislation + Marriage.Civil.Partnership + Adoption.Allowed +
Transgender.Rights + Intersex.3rd.Option + Equal.Age.of.Consent +
X.Conversion.Therapy + LGBT.Marketing + Religious.Influence +
HIV.Travel.Restrictions + Anti.Gay.Laws + Homosexuality.Illegal +
Pride.Banned + Locals.Hostile + Prosecution + Murders + Death.Sentences, data = gaydata)
OR
lasso_2 = glmnet(x=gaydata, y=gaydata$GTI, alpha=1)
Removing 'Country' since it is categorical data that may be causing an issue
gaydata = subset(gaydata, select = -Country)
Trying to identify what is causing "argument is of length zero" error
sapply(gaydata, is.null)
sapply(gaydata, is.factor)
sum(is.null(gaydata))
In my research in trying to find a solution to this issue, I've seen that nulls, incorrect column names, and issues with factor variables typically cause the error. However, my data does not have those problems so I'm lost. My data is a copy and paste from the

Just figured it out with the help of a statistician:
Apparently I needed to change my dataset into a matrix
gaydata = as.matrix(gaydata)
and use the following format
lasso_0 = glmnet(y=gaydata[,2], x=gaydata[,-2])

Related

Using mlogit.data() gives error: r error 4 (Error in `[.data.frame`(x, start:min(NROW(x), start + len)) : undefined columns selected )

I'm trying to change the format so that all the prices are under one column. End goal is to run a “mixed” multinomial logit model, hence the mlogit.data() function.
Everything runs fine until you try to open the data frame. that's when I get this error:
r error 4 (Error in [.data.frame(x, start:min(NROW(x), start + len)) : undefined columns selected ).
What can I do to get this working?
library(mlogit)
choice = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/choice.csv')
choice <- mlogit.data(choice, shape="wide", varying=5:9, choice="choice")
what the end regression will look like:
mixed <- mlogit(choice ~ 0 | log(income) + female + education + price, data = choice)

Error in seq.default(from = min(k), to = max(k), length = nBreaks + 1) : 'from' must be a finite number. WISH-R package

I have a list of pre-filtered genomic regions (based on previous GWAS and some enrichment analysis performed on GSEA) and I am looking for interesting gene-gene interactions.
i have a binary phenotype and i have used glm=T in the model of course.
I have followed in detail the WISH-R guide - https://github.com/QSG-Group/WISH - and generated the correlations matrix without issues.
I am now struggling to use the generate.modules function, so I am writing here for some help.
i have tried several times to run generate.modules(correlations,values="Coefficients",thread=2)
before that I have also run as suggested:
correlations$Coefficients[(is.na(correlations$Coefficients))]<-0
correlations$Pvalues[(is.na(correlations$Pvalues))]<-1
This is my R code:
library(WISH)
library(data.table)
ped <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/all_snp_tf_recoded.ped", data.table=F)
tped <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/all_snp_tf_recoded.tped", data.table=F)
pval <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/ALL_SNP_TF_p.txt", data.table=F)
id <- fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/ALL_SNP_TF_id.txt", data.table=F)
genotype <-generate.genotype(ped,tped,snp.id=id, pvalue=0.005,id.select=NULL,gwas.p=pval,major.freq=0.95,fast.read=T)
LD_genotype<-LD_blocks(genotype)
genotype <- LD_genotype$genotype
pheno<-fread("D:/Dati/GWAS_ITALIAN_PBC_Mike_files/EPISTASI/epistasi_all SNPs_all_TF/file_epistasi_per_wish/pheno.txt",data.table=F)
pheno<-ifelse(pheno=="1","0","1")
pheno<-as.numeric(pheno)
correlations<-epistatic.correlation(pheno, genotype,threads = 2 ,test=F,glm=T)
genome.interaction(tped,correlations,quantile = 0.9)
correlations$Coefficients[(is.na(correlations$Coefficients))]<-0
correlations$Pvalues[(is.na(correlations$Pvalues))]<-1
generate.modules(correlations,values="Coefficients",thread=2)
I get the following error:
Error in seq.default(from = min(k), to = max(k), length = nBreaks + 1) :
'from' must be a finite number.
Do you have some hints to debug this error here?
What is the main issue here?

LME error in model.frame.default ... variable lengths differ

I am trying to run a random effects model with LME. It is part of a larger function and I want it to be flexible so that I can pass the fixed (and ideally random) effects variable names to the lme function as variables. get() worked great for this where I started with lm, but it only seems to throw the ambiguous "Error in model.frame.default(formula = ~var1 + var2 + ID, data = list( : variable lengths differ (found for 'ID')." I'm stumped, the data are the same lengths, there are no NAs in this data or the real data, ...
set.seed(12345) #because I got scolded for not doing this previously
var1="x"
var2="y"
exdat<-data.frame(ID=c(rep("a",10),rep("b",10),rep("c",10)),
x = rnorm(30,100,1),
y = rnorm(30,100,2))
#exdat<-as.data.table(exdat) #because the data are actually in a dt, but that doesn't seem to be the issue
Works great
lm(log(get(var1))~log(get(var2)),data=exdat)
lme(log(y)~log(x),random=(~1|ID), data=exdat)
Does not work
lme(log(get(var1,pos=exdat))~log(get(var2)),random=(~1|ID), data=exdat)
Does not work, but throws a new error code: "Error in model.frame.default(formula = ~var1 + var2 + rfac + exdat, data = list( : invalid type (list) for variable 'exdat'"
rfac="ID"
lme(log(get(var1))~log(get(var2)),random=~1|get(rfac,pos=exdat), data=exdat)
Part of the problem seems to be with the nlme package. If you can consider using lme4, the desired results can be obtained by with:
lme4::lmer(log(get(var1)) ~ log(get(var2)) + (1 | ID),
data = exdat)

R: NA/NaN/Inf in X error

I am trying to perform a negative binomial regression using R. When I am executing the following command:
DV2.25112013.nb <- glm.nb(DV2.25112013~ Bcorp.Geographic.Proximity + Dirty.Industry +
Clean.Industry + Bcorp.Industry.Density + State + Dirty.Region +
Clean.Region + Bcorp.Geographic.Density + Founded.As.Bcorp + Centrality +
Bcorp.Industry.Density.Squared + Bcorp.Geographic.Density.Squared +
Regional.Institutionalization + Sales + Any.Best.In.Class +
Dirty.Region.Heterogeneity + Clean.Region.Heterogeneity +
Ind.Dirty.Heterogeneity+Ind.Clean.Heterogeneity + Industry,
data = analysis25112013DF6)
R gives the following error:
Error in glm.fitter(x = X, y = Y, w = w, etastart = eta, offset = offset, :
NA/NaN/Inf in 'x'
In addition: Warning message:
step size truncated due to divergence
I do not understand this error since my data matrix does not contain any NA/NaN/Inf values...how can I fix this?
thank you,
I think the most likely cause of this error are negative values or zeros in the data, since the default link in glm.nb is 'log'. It would be easy enough to test by changing link="identity". I also think you need to try smaller models .... maybe a quarter of those variables to start. That also lets you add related variables as bundles since it looks from the names that you have possibly severe potential for collinearity with categorical variables.
We really need a data description. I wondered about Dirty.Industry + Clean.Industry. That is the sort of dichotomy that is better handled with a factor variable that has those levels. That prevents the collinearity if Clean = not-Dirty. Perhaps similarly with your "Heterogeneity" variables. (I'm not convinced that #BenBolker's comment is correct. I think it very possible that you first need statistical consultation before address coding issues.)
require(MASS)
data(quine) # following example in ?glm.nb page
> quine$Days[1] <- -2
> quine.nb1 <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine, link = "identity")
Error in eval(expr, envir, enclos) :
negative values not allowed for the 'Poisson' family
> quine$Days[1] <- 0
> quine.nb1 <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine, link = "identity")
Error: no valid set of coefficients has been found: please supply starting values
In addition: Warning message:
In log(y/mu) : NaNs produced
i have resolved this issue by putting in the control argument into the model assumptions with maxiter=10 or lower. the default is 50 iterations. perhaps it works for you with a little more iterations. just try

RandomForest error code

I am trying to run a rather simple randomForest. I keep having an error code that does not make any sense to me. See code below.
test.data<-data.frame(read.csv("test.RF.data.csv",header=T))
attach(test.data)
head(test.data)
Depth<-Data1
STemp<-Data2
FPT<-Sr_hr_15
Stage<-stage_feet
Q<-discharge_m3s
V<-vel_ms
Turbidity<-turb_ntu
Day_Night<-day_night
FPT.rf <- randomForest(FPT ~ Depth + STemp + Q + V + Stage + Turbidity + Day_Night, data = test.data,mytry=1,importance=TRUE,na.action=na.omit)
Error in randomForest.default(m, y, ...) : data (x) has 0 rows
In addition: Warning message:
In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I then run the dimensions to ensure there is infact data recognized in R
dim(test.data)
[1] 77 15
This is a subset of the complete data set I ran just to test if I could get it to run since I got the same error with the complete data set.
Why is it telling me data(x) has 0 rows when clearly there is.
Thanks

Resources