R unexpected NA output from RandomForest - r

I'm working with a data set that has a lot of NA's. I know that the first 6 columns do NOT have any NA's. Since the first column is an ID column I'm omitting it.
I run the following code to select only lines that have values in the response column:
sub1 <- TrainingData[which(!is.na(TrainingData[,70])),]
I then use sub1 as the data set in a randomForest using this code:
set.seed(448)
RF <- randomForest(sub1[,c(2:6)], sub1[,70]
,do.trace=TRUE,importance=TRUE,ntree=10,,forest=TRUE)
then I run this code to check the output for NA's:
> length(which(is.na(RF$predicted)))
[1] 65
I can't figure out why I'd be getting NA's if the data going in is clean.
Any suggestions?

I think you should use more trees. Because predicted values are preditions for the out-of-bag set. And if number of trees very small some cases are never present in out-of-bag set, because this set forms randomly.

Related

T test using column variable from 2 different data frames in R

I am attempting to conduct a t test in R to try and determine whether there is a statistically significant difference in salary between US and foreign born workers in the Western US. I have 2 different data frames for the two groups based on nativity, and want to compare the column variable I have on salary titled "adj_SALARY". For simplicity, say that there are 3 observations in the US_Born_west frame, and 5 in the Immigrant_West data frame.
US_born_West$adj_SALARY=30000, 25000,22000
Immigrant_West$adj_SALARY=14000,20000,12000,16000,15000
#Here is what I attempted to run:
t.test(US_born_West$adj_SALARY~Immigrant_West$adj_SALARY, alternative="greater",conf.level = .95)
However I received this error message: "Error in model.frame.default(formula = US_born_West$adj_SALARY ~ Immigrant_West$adj_SALARY) :
variable lengths differ (found for 'Immigrant_West$adj_SALARY')"
Any ideas on how I can fix this? Thank you!
US_born_West$adj_SALARY and Immigrant_West$adj_SALARY are of unequal length. Using formula interface of t.test gives an error about that. We can pass them as individual vectors instead.
t.test(US_born_West$adj_SALARY, Immigrant_West$adj_SALARY,
alternative="greater",conf.level = .95)

SVM Predict Levels not matching between test and training data

I'm trying to predict a binary classification problem dealing with recommending films.
I've got a training data set of 50 rows (movies) and 6 columns (5 movie attributes and a consensus on the film).
I then have a test data set of 20 films with the same columns.
I then run
pred<-predict(svm_model, test)
and receive
Error in predict.svm(svm_model, test) : test data does not match model !.
From similar posts, it seems that the error is because the levels don't match between the training and test datasets. This is true and I've proved it by comparing str(test) and str(train). However, both datasets come from randomly selected films and will always have different levels for their categorical attributes. Doing
levels(test$Attr1) <- levels(train$Attr1)
changes the actual column data in test, thus rendering the predictor incorrect. Does anyone know how to solve this issue?
The first half dozen rows of my training set are in the following link.
https://justpaste.it/1ifsx
You could do something like this, assuming Attr1 is a character:
Create a levels attribute with the unique values from attribute1 from both test and train.
Create a factor on train and test attribute1 with all the levels found in point 1.
levels <- unique(c(train$Attr1, test$Attr1))
test$Attr1 <- factor(test$Attr1, levels=levels)
train$Attr1 <- factor(train$Attr1, levels=levels)
If you do not want factos, add as.integer to part of the code and you will get numbers instaed of factors. That is sometimes handier in models like xgboost and saves on one hot encoding.
as.integer(factor(test$Attr1, levels=levels))

Easy or default way to exclue rows with NA values from individual operations and comparisons

I work with survey data, where missing values are the rule rather than the exception. My datasets always have lots of NAs, and for simple statistics I usually want to work with cases that are complete on the subset of variables required for that specific operation, and ignore the other cases.
Most of R's base functions return NA if there are any NAs in the input. Additionally, subsets using comparison operators will return a row of NAs for any row with an NA on one of the variables. I literally never want either of these behaviors.
I would like for R to default to excluding rows with NAs for the variables it's operating on, and returning results for the remaining rows (see example below).
Here are the workarounds I currently know about:
Specify na.rm=T: Not too bad, but not all functions support it.
Add !is.na() to all comparison operations: Works, but it's annoying and error-prone to do this by hand, especially when there are multiple variables involved.
Use complete.cases(): Not helpful because I don't want to exclude cases that are missing any variable, just the variables being used in the current operation.
Create a new data frame with the desired cases: Often each row is missing a few scattered variables. That means that every time I wanted to switch from working with one variable to another, I'd have to explicitly create a new subset.
Use imputation: Not always appropriate, especially when computing descriptives or just examining the data.
I know how to get the desired results for any given case, but dealing with NAs explicitly for every piece of code I write takes up a lot of time. Hopefully there's some simple solution that I'm missing. But complex or partial solutions would also be welcome.
Example:
> z<-data.frame(x=c(413,612,96,8,NA), y=c(314,69,400,NA,8888))
# current behavior:
> z[z$x < z$y ,]
x y
3 96 400
NA NA NA
NA.1 NA NA
# Desired behavior:
> z[z$x < z$y ,]
x y
3 96 400
# What I currently have to do in order to get the desired output:
> z[(z$x < z$y) & !is.na(z$x) & !is.na(z$y) ,]
x y
3 96 400
One trick for dealing with NAs in inequalities when subsetting is to do
z[which(z$x < z$y),]
# x y
# 3 96 400
The which() silently drops NA values.

Select multiple observations in a matrix based on a specific condition

I am very new to the R interface but need to use the program in order to run the relevant analyses for my clinical doctorate thesis. So, apologies in advance if this is a novice question.
I have a matrix of beta methylation values with the following dimensions:485577x894. The row names of the matrix refer to cpg sites which range in non-numerical and non-ascending order (e.g. "cg00000029" "cg00000108" "cg00000109" "cg00000165"), while the column names refer to participant IDs which are also in non-numerical and non-ascending order (e.g. "11209" "14140" "1260" "5414").
I would like to identify which beta methylation values are > 0.5 so that I can exclude them from further analyses. In doing so, I need the data to stay in a matrix format. All attempts I have made to conduct this analysis have resulted in retrieval of integer variables rather than the data in a matrix format.
I would be so grateful if someone could please advise me of the code to conduct this analysis.
Thank you for your time.
Cheers,
Alicia
set.seed(1) # so example is reproduceable
m <- matrix(runif(1000,0,0.6),nrow=100) # 100 rows X 10 cols, data in U[0,0.6]
m[m>0.5]<-NA # anything > 0.5 set to NA
z <- na.omit(m) # remove all rows with any NA's

Error when using mshapiro.test: "U[] is not a matrix with number of columns (sample size) between 3 and 5000"

I am trying to perform a multivariate test for normality on some density data from five sites, using mshapiro.test from the mvnormtest package. Each site is a column, and densities are below. It is 5 columns and 5 rows, with the top row as the header (site names). Here is how I loaded my data:
datafilename="/Users/megsiesiple/Documents/Lisa/lisadensities.csv"
data.nc5=read.csv(datafilename,header=T)
attach(data.nc5)`
The data look like this:
B07 B08 B09 B10 M
1 72571.43 17714.29 3142.86 22571.43 8000.00
2 44571.43 46857.14 49142.86 16857.14 7142.86
3 54571.43 44000.00 26571.43 6571.43 17714.29
4 57714.29 38857.14 32571.43 2000.00 5428.57
When I call mshapiro.test() for data.nc5 I get this message: Error in mshapiro.test(data.nc5) :
U[] is not a matrix with number of columns (sample size) between 3 and 5000
I know that to perform a Shapiro-Wilk test using mshapiro.test(), the data has to be in a numeric matrix, with a number of columns between 3 and 5000. However, even when I make the .csv a matrix with only numbers (i.e., when I omit the Site names), I still get the error. Do I need to set up the matrix differently? Has anyone else had this problem?
Thanks!
You need to transpose the data in a matrix, so that your variables are in rows, and observations in columns. The command will be :
M <- t(data.nc5[1:4,1:5])
mshapiro.test(M)
It works for me this way. The labels in the first row should be recognized during the import, so the data will start from row 1. Otherwise, there will be a "missing value" error.
If you read the numeric matrix into R via read.csv() using similar code to that you do show, it will be read in as a data frame, and that is not a matrix.
Try
mat <- data.matrix(data.nc5)
mshapiro.test(mat)
(Not tested as you don't give a reproducible example and it is late-ish in my time zone now ;-)

Resources