Error in dataframe *tmp* replacement has x data has y - r

I'm a beginner in R. Here is a very simple code where I'm trying to save the residual term:
# Create variables for child's EA:
dat$cldeacdi <- rowMeans(dat[,c('cdcresp', 'cdcinv')],na.rm=T)
dat$cldeacu <- rowMeans(dat[,c('cucresp', 'cucinv')],na.rm=T)
# Create a residual score for child EA:
dat$cldearesid <- resid(lm(cldeacu ~ cldeacdi, data = dat))
I'm getting the following message:
Error in `$<-.data.frame`(`*tmp*`, cldearesid, value = c(-0.18608488908881, :
replacement has 366 rows, data has 367
I searched for this error but couldn't find anything that could resolve this. Additionally, I've created the exact same code for mom's EA, and it saved the residual just fine, with no errors. I'd be grateful if someone could help me resolve this.

I have a feeling you have NAs in your data. Look at this example:
#mtcars data set
test <- mtcars
#adding just one NA in the cyl column
test[2, 2] <- NA
#running linear model and adding the residuals to the data.frame
test$residuals <- resid(lm(mpg ~ cyl, test))
Error in `$<-.data.frame`(`*tmp*`, "residuals", value = c(0.382245430809409, :
replacement has 31 rows, data has 32
As you can see this results in a similar error to yours.
As a validation:
length(resid(lm(mpg ~ cyl, test)))
#31
nrow(test)
#32
This happens because lm will run na.omit on the data set prior to running the regression, so if you have any rows with NA these will get eliminated resulting in fewer results.
If you run na.omit on your dat data set (i.e. dat <- na.omit(dat) at the very beginning of your code then your code should work.

This is an old thread, but maybe this can help someone else facing the same issue. To LyzandeR's point, check for NA's as a first line of defense. In addition, make sure that you don't have any factors in x, as this can also cause the error.

Related

Error with RandomForest in R because of "too many categories"

I'm trying to train a RF model in R, but when i try to define the model:
rf <- randomForest(labs ~ .,data=as.matrix(dd.train))
It gives me the error:
Error in randomForest.default(m, y, ...) :
Can not handle categorical predictors with more than 53 categories.
Any idea what could it be?
And no, before you say "You have some categoric variable with more than 53 categories". No, all variables but labs are numeric.
Tim Biegeleisen: Read the last line of my question and you will see why is not the same as the one you are linking!
Edited to address followup from OP
I believe using as.matrix in this case implicitly creates factors. It is also not necessary for this packages. You can keep it as a data frame, but will need to make sure that any unused factor levels are dropped by using droplevels (or something similar). There are many reasons an unused factor may be in your data set, but a common one is a dropped observation.
Below is a quick example that reproduces your error:
library('randomForest')
#making a toy data frame
x <- data.frame('one' = c(1,1,1,1,1,seq(50) ),
'two' = c(seq(54),NA),
'three' = seq(55),
'four' = seq(55) )
x$one <- as.factor(x$one)
x <- na.omit(x) #getting rid of an NA. Note this removes the whole row.
randomForest(one ~., data = as.matrix(x)) #your first error
randomForest(one ~., data = x) #your second error
x <- droplevels(x)
randomForest(one ~., data = x) #OK

SVM Prediction is dropping values

I'm running the SVM model on a dataset, which runs through fine on the train/fitted model. However when I run it for the prediction/test data, it seems to be dropping rows for some reason, when I try to add 'pred_SVM' back into the dataset, the lengths are different.
Below is my code
#SVM MODEL
SVM_swim <- svm(racetime_mins ~ event_date+ event_month +year
+event_id +
gender + place + distance+ New_Condition+
raceNo_Updated +
handicap_mins +points+
Wind_Speed_knots+
Air_Temp_Celsius +Water_Temp_Celsius +Wave_Height_m,
data = SVMTrain, kernel='linear')
summary(SVM_swim)
#Predict Race_Time Using Test Data
pred_SVM <- predict(SVM_swim, SVMTest, type ="response")
View(pred_SVM)
#Add predicted Race_Times back into the test dataset.
SVMTest$Pred_RaceTimes<- pred_SVM
View(SVMTest) #Returns 13214 rows
View(pred_SVM) #Returns 12830
Error in $<-.data.frame(*tmp*, Pred_RaceTime, value = c(2 = 27.1766438249356, :
replacement has 12830 rows, data has 13214
As it is mentioned in the command, you need to get rid of the NA values in your dataset. SVM is handling it for you so that, the pred_SVM output is calculated without the NA values.
To test if there exist NA in your data, just run : sum(is.na(SVMTest))
I am pretty sure that you will see a number greater than zero.
Before starting to build your SVM algorithm, get rid of all NA values by,
dataset <- dataset[complete.cases(dataset), ]
Then after separating your data into Train and Test sets you can run ,
SVM_swim <- svm(.....,data = SVMTrain, kernel='linear')

How to fix 'differing rows' error in predict function?

I am trying to use the predict function but the output does not have the number of trials I expect. I assume something is wrong with my data.frame after reading other errors but can't figure it out.
I've tried to make sure my newdata has the same variable name as my model but that won't fix it. The differing rows are the differing number of solutions being found, for example I train over 50 different sets of information, and I test over 39950 sets.
In both the train_data and the test_data there are 10 columns which are the samples that will be included in each calculation. The model correctly finds these and names them test_data1, test_data2, etc.
I'm sure there is something I'm missing but I can't seem to figure it out.
trainingSampleSize <- k
sample_sample[[k-1]] <- sample(1:ncol(pre$train_data), k, replace = FALSE)
train_data <- pre$train_data[,sample_sample[[k-1]]]
test_data <- pre$test_data[,sample_sample[[k-1]]]
data_lm <- data.frame(train_data, pre$train_targets)
cvFitList[[(k-1)]] <- lm(pre$train_targets ~ train_data, data_lm)
prediction[[k-1]] <- predict(cvFitList[[(k-1)]], data.frame(train_data=test_data))
My goal is to get a prediction for every set of test_data, 39950 results from predict.
I got a warning message:
'newdata' had 39950 rows but variables found have 50 rows
and prediction[[k-1]] has only 50 rows

How to get Cox p-value for each gene?

If you run the following code, you will have a data frame real.dat which has 1063 samples for 20531 genes. There are 2 extra columns named time and event where time is the survival time and event is death in case of 1 and 0 in case of censored.
lung.dat <- read.table("genomicMatrix_lung")
lung.clin.dat <- read.delim("clinical_data_lung")
# For clinical data, get only rows which do not have NA in column "X_EVENT"
lung.no.na.dat <- lung.clin.dat[!is.na(lung.clin.dat$X_EVENT), ]
# Getting the transpose of main lung cancer data
ge <- t(lung.dat)
# Getting a vector of all the id's in the clinical data frame without any 'NA' values
keep <- lung.no.na.dat$sampleID
# getting only the samples(persons) for which we have a value rather than 'NA' values
real.dat <- ge[ge[, 1] %in% keep, ]
# adding the 2 columns from clinical data to gene expression data
keep_again <- real.dat[, 1]
temp_df <- lung.no.na.dat[lung.no.na.dat$sampleID %in% keep_again, ]
# naming the columns into our gene expression data
col_names <- ge[1, ]
colnames(real.dat) <- col_names
dd <- temp_df[, c('X_TIME_TO_EVENT', 'X_EVENT')]
real.dat <- cbind(real.dat, dd)
# renaming the 2 new added columns
colnames(real.dat)[colnames(real.dat) == 'X_TIME_TO_EVENT'] <- 'time'
colnames(real.dat)[colnames(real.dat) == 'X_EVENT'] <- 'event'
I want to get the univariate Cox regression p-value for each gene in the above data frame. How can I get this?
You can download the data from here.
Edit: Sorry for not clarifying enough. I have already tried to get it with the coxph function from the survival library. But even for one gene, it shows the following error -
> coxph(Surv(time, event) ~ HIF3A, real.dat)
Error in fitter(X, Y, strats, offset, init, control, weights = weights, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning message:
In fitter(X, Y, strats, offset, init, control, weights = weights, :
Ran out of iterations and did not converge
That is why I did not provide a smaller reproducible example.
You really going to do univariate regression for each gene of 20531 genes??
Guessing wildly at the structure of your data (so creating a dummy set, based on the examples in help), and guessing what you're trying to do with the following toy example.....
library("survival")
?coxph ## to see the examples
## create dummy data
test <- list(time=c(4,3,1,1,2,2,3),
event=c(1,1,1,0,1,1,0),
gene1=c(0,2,1,1,1,0,0),
gene2=c(0,0,0,0,1,1,1))
## Cox PH regression
coxph(Surv(time, event) ~ gene1, test)
coxph(Surv(time, event) ~ gene2, test)
You may wish to use the following to get CIs and more information.
summary(coxph(...))
Hopefully that code is reproducible enough to help you clarify the question

Removing character level outlier in R

I have a linear model1<-lm(divorce_rate~marriage_rate+median_age+population) for which the leverage plot shows an outlier at 28 (State variable id for "Nevada"). I'd like to specify a model without Nevada in the dataset. I tried the following but got stuck.
data<-read.dta("census.dta")
attach(data)
data1<-data.frame(pop,divorce,marriage,popurban,medage,divrate,marrate)
attach(data1)
model1<-lm(divrate~marrate+medage+pop,data=data1)
summary(model1)
layout(matrix(1:4,2,2))
plot(model1)
dfbetaPlots(lm(divrate~marrate+medage+pop),id.n=50)
vif(model1)
dataNV<-data[!data$state == "Nevada",]
attach(dataNV)
model3<-lm(divrate~marrate+medage+pop,data=dataNV)
The last line of the above code gives me
Error in model.frame.default(formula = divrate ~ marrate + medage + pop, :
variable lengths differ (found for 'medage')
I suspect that you have some glitch in your code such that you have attach()ed copies that are still lying around in your environment -- that's why it's really best practice not to use attach(). The following code works for me:
library(foreign)
## best not to call data 'data'
mydata <- read.dta("http://www.stata-press.com/data/r8/census.dta")
I didn't find divrate or marrate in the data set: I'm going to speculate that you want the per capita rates:
## best practice to use a new name rather than transforming 'in place'
mydata2 <- transform(mydata,marrate=marriage/pop,divrate=divorce/pop)
model1 <- lm(divrate~marrate+medage+pop,data=mydata2)
library(car)
plot(model1)
dfbetaPlots(model1)
This works fine for me in a clean session:
dataNV <- subset(mydata2,state != "Nevada")
## update() may be nice to avoid repeating details of the
## model specification (not really necessary in this case)
model3 <- update(model1,data=dataNV)
Or you can use the subset argument:
model4 <- update(model1,subset=(state != "Nevada"))

Resources