I'm working on a project for my Economics capstone with a very large data set. This is my first time ever programming and I had to merge multiple data sets, 16 in total, with anywhere between 30,000-130,000 observations. I did experience an issue merging the data sets since certain data sets contained more columns than others, but I was able to address it using "rbind.fill" Afterwards, I attempted to run a regression but I encountered an error. The error was
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
Here is the original code for the regression
ols_reg_mortcur1 <- lm(MORTCUR ~ EST_ST + WEEK + TBIRTH_YEAR + EGENDER + RHISPANIC +
RRACE + EEDUC + MS + THHLD_NUMPER + THHLD_NUMKID + THHLD_NUMADLT + WRKLOSS + ANYWORK +
KINDWORK + RSNNOWRK + UNEMPPAY + INCOME + TENURE + MORTCONF, data = set_up_weeks15st)
I googled the error for some possible solutions; I found solutions like "na.omit", "na.exclude"' etc. I tried these solutions to aval. This leads me to think I didn't implement them correctly or perhaps something went wrong with the merge itself. While I was cleaning the data I set unknown or missing values, listed as -88 or -99 in the data sets, to NA since I had to create a summary stats table. I'll attach my R doc. I do apologize for the length of the attached code below I was sure if to just attach the sections leading up to the regression or include other lines.
Based on the error message,
0 (non-NA) cases the likely reason is that you have at least one NA in each of your rows. (Easy to check this by using na.omit(set_up_weeks15st). This should return zero rows.)
In this case, setting na.action to na.omit or na.exclude is not going to help.
Try to find columns with most NA's and remove them, or impute the missing values using an appropriate method.
Related
I am trying to perform multiple logistical regression with some of the variables that came out as statistically significant for a diseased conditions with univariate analysis. We took the cut off for that as p<0.2 since our sample size was ~300. I made a new dataframe for these variables
regression1df <- data.frame(dgfcriteria, recipientage, ESRD_dx,bmirange,graftnumber, dsa_class_1, organ_tx, transfuse01m, transfuse1yr, readmission1yr, citrange1, switrange, anastamosisrange, donorage, donorgender, donorcriteria, donorionotrope, intubaterange, kdpirange, kdrirange, eptsrange, proteinuria, terminalurea, na.rm=TRUE)
I'm using variables to predict for disease condition, which is DGF (dgfcriteria==1), and non-disease is no DGF (dgfcriteria==0).
Here is structure of the data.
When I tried to run the entire list of variables with the glm code I got:
predictors1 <- glm(dgfcriteria ~.,
data = predictors1df,
family = "binomial" )
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels.
But when I run it with only some of the variables of the dataframe, there is an output.
predictors1 <- glm(dgfcriteria ~ recipientage + ESRD_dx + bmirange + graftnumber + dsa_class_1 + organ_tx + transfuse01m + transfuse1yr + readmission1yr +citrange1 +switrange + anastamosisrange+ donorage+ donorgender + donorcriteria + donorionotrope,
data = predictors1df,
family = "binomial" )
This output looks really strange though with alot of NAs.
Where have I gone wrong?
Looking at your data structure, you've got a lot of missing values. Quite a few of your variables look to have only 2 or 3 non-missing values in the first 10 rows. When you run regression on data with missing values, the default is to drop all rows that have any missing values.
Apparently some of your data has bad overlaps, so that when all the rows with missing values are dropped (see na.omit(your_data) for what is left over), some variables only have one level left and are therefore no longer fit for regression. Of course, when you only use some variables, fewer rows will be dropped and you may be in a better situation.
So, you'll have to decide what to do with your missing values. This should depend on your goals and your understanding of the reasons for missingness. Common possibilities include omission, imputation, creating new "missing" levels, and taking level of missingness into account in your variable selection.
I'm trying to carry out covariate balancing using the ebal package. The basic code is:
W1 <- weightit(Conformidad ~ SexoCon + DurPetFiscPrisión1 +
Edad + HojaHistPen + NacionCon + AnteVivos +
TipoAbog + Reincidencia + Habitualidad + Delitos,
data = Suspension1,
method = "ebal", estimand = "ATT")
I then want to check the balance using the summary function:
summary(W1)
This originally worked fine but now I get the error message:
Error in rep(" ", spaces1) : invalid 'times' argument
It's the same dataset and same code, except I changed some of the covariates. But now even when I go back to original covariates I get the same error. Any ideas would be very appreciated!!
I'm the author of WeightIt. That looks like a bug. I'll take a look at it. Are you using the most updated version of WeightIt?
Also, summary() doesn't assess balance. To do that, you need to use cobalt::bal.tab(). summary() summarizes the distribution of the weights, which is less critical than examining balance. bal.tab() displays the effect sample size as well, which is probably the most important statistic produced by summary().
I encountered the same error message. This happens when the treatment variable is coded as factor or character, but not as numeric in weightit.
To make summary() work, you need to use 1 and 0.
I am a complete beginner at R and don't have much time to complete this analysis.
I need to run propensity score matching. I am using RStudio and have
Uploaded my dataset which is called 'R' and was saved on my desktop
Installed and loaded package Matchit
My dataset has the following headings:
BA (my grouping variable of which someone is either on BA or not, 0=off, 1=on),
Then age, sex, timesincediagnosis, TVS, and tscore which are my matching variables.
I have adapted the following code which I have found online
m.nn <- matchit(ba ~ age + sex + timesincediagnosis + TVS + tscore,
data = R, method= " nearest", ratio = 1)
summary(m.nn)
I am getting the following errors:
Error in summary(m.nn) : object 'm.nn' not found
Error in matchit(ba ~ age + sex + timesincediagnosis + TVS + tscore,
data = R, : nearestnot supported.
I would really appreciate any help with why I am getting the errors or how I can change my code.
Thank you!
Credit goes to #MrFlick for noticing this, but the problem is that " nearest" is not an acceptable value to be passed to method. What you want is "nearest" (without the leading space in the string). (Note that the default method is nearest neighbor matching, so you can omit the method argument entirely if this is what you want to do.)
The error print first (Error in summary(m.nn) : object 'm.nn' not found) occurs because R didn't create the m.nn object because of the other error.
I'm trying to use esttab to output regression results in R. However, every time I run it I get an error:
Error in FUN(X[[i]], ...) : variable names are limited to 10000 bytes
. Any ideas how to solve it? My code is below:
reg <- lm(y ~ ln_gdp + diffxln_gdp + diff + year, data=df)
eststo(reg)
esttab(store=reg)
The input data comes from approx 25,000 observations. It's all coded as numeric. I can share more information that is deemed relevant but I don't know what that would be right now.
Thanks!
I'm new to data analysis, and I have a couple questions about using lm() in R to create a linear regression model of my data.
My data looks like this:
testID userID timeSpentStudying testGrade
12345 007 10 90
09876 008 0 75
And my model:
model <- lm(formula = data$testGrade ~ timeSpentStudying, data = data)
I'm getting the following error (twice), across just under 60 rows of data from RStudio:
Warning messages:
1: In sqrt(crit * p * (1 - hh)/hh) : NaNs produced
2: In sqrt(crit * p * (1 - hh)/hh) : NaNs produced
My question is, does the problem have to do with the data containing many instances of zero being the value, such as above under the 'timeSpentStudying' column? If so, how do I handle that? Shouldn't lm() be able to handle values of zero, especially if that would give significance to the data itself?
Thanks!
So far I have been unable to replicate this, e.g.:
dd <- data.frame(y=rnorm(1000),x=c(rep(0,990),1:10))
model <- lm(y~x, data = dd)
summary(model)
Searching the R code base for the code listed in your error and tracing back indicates that the relevant lines are in plot.lm, the function that plots diagnostics, and that the problem is that you are somehow getting a value >1 for the leverage or "hat values" of one of your data points. However, I can't see how you could be achieving that. Data would make this much clearer!