Handling skip in rpart and random forest - r

I have a dataset containing 10 categorical variables. Each of these has missing values coded as (-9, -6, -3, -2, -1). I want to create 1 column that takes the mean of these 10 variables excluding the negative values. I can collapse the negative values into NA and then median impute them but I need to retain -6 since -6 implies that the person skipped the question because it does not apply to them. For instance, parental relationship quality does not apply to single parents. I ultimately want to use this variable as a predictor in my random forest model so I am not sure how to handle -6 in this case. One way that I could think of is to impute each of the 10 variables as follows (Let's say that the 10 variables are a1 to a10):
missing_categs <- c(-9, -3, -2, -1)
df[df$a1%in%missing_categs,]$a1 <- assign median value of a1
After the above step, I calculate the average of a1 to a10. The ones that yield "-6" are the ones that pertain to single parents (which means it does not apply to them). then, I convert -6 to NA. So, now I have average values and one NA. Can rpart and random forest models handle NA? Other better alternative solutions are most welcome. Thanks in advance!

Can rpart and random forest models handle NA?
I do not know what you mean with handle. If you mean that you can use NA in the predictors than the answer is yes for rpart
> library(rpart)
> df <- data.frame(c(1, 2, NA), c(4, 5, 6))
> rpart(df, na.action=na.pass)
n= 3
node), split, n, deviance, yval
* denotes terminal node
but no for randomForest
> library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
> df <- data.frame(c(1, 2, NA), c(4, 5, 6))
> randomForest(df, na.action=na.pass)
Error in randomForest.default(df, na.action = na.pass) :
NA not permitted in predictors
If you mean handle that they are able to deal with them in some manner, for example by giving them a function, than the answer is yes for both.
rpart and randomForest have the parameter na.action which you can use. See here for rpart and here for randomForest.
The default for rpart na.action is na.rpart which deletes "all observations for which y is missing" and "those in which one or more predictors are missing" are kept.
The default for randomForest na.action is na.fail which returns the given data structure unaltered if no NA's are found, and if at least one NA is found it "signals an error".

Related

R: dpois not returning correct probabilities?

I am working with dataset of the number of truffles found in 288 search areas. I am planning to test the null hypothesis that the truffles are distributed randomly, thus I am using dpois() to to calculate the expected probability densities. There are 4 categories (0, 1, 2, or 3 truffles per plot). The expected probabilities will later be converted to expected proportions and incorporated into a chisq.test analysis.
The problem is that the expected probabilities that I get with the following code don't make sense. They should sum to 1, but are much too small. I run the same exact code with another dataset and it produces normal values. What is going on here?
trufflesFound<-c(rep(0,203),rep(1,39),rep(2,18),rep(3,28))
trufflesTable<-table(trufflesFound)
trufflesTable
mean(trufflesTable)
expTruffPois<-dpois(x = 0:3, lambda = mean(trufflesTable))
expTruffPois
These are the probabilities it gives me, which are much too low!
0: 0.00000000000000000000000000000005380186
1: 0.00000000000000000000000000000387373404
2: 0.00000000000000000000000000013945442527
3: 0.00000000000000000000000000334690620643
In contrast, this dataset works just fine:
extinctData<-c(rep(1,13),rep(2,15),rep(3,16),rep(4,7),rep(5,10),rep(6,4),7,7,8,9,9,10,11,14,16,16,20)
extinctFreqTable <- table(extinctData)
extinctFreqTable
mean(extinctFreqTable)
expPois <- dpois(x = 0:20, lambda = mean(extinctFreqTable))
expPois
sum(expPois)
The sum is 0.9999997, which is close to the expected value of 1
Thoughts?
Lambda should be the average frequency, but taking mean(trufflesTable) returns the average of the counts of frequencies. Use mean(trufflesFound) instead. The reason the second one looks "right" is because mean(extinctData) is relatively close to mean(extinctFreqTable).
Note that the probabilities don't sum exactly to 1, because given the mean it is conceivable that we'd observe more than 4 truffles in a future search area.
trufflesFound<-c(rep(0,203),rep(1,39),rep(2,18),rep(3,28))
expTruffPois<-dpois(x = 0:3, lambda = mean(trufflesFound))
expTruffPois
#> [1] 0.57574908 0.31786147 0.08774301 0.01614715
sum(expTruffPois)
#> [1] 0.9975007
Created on 2022-02-08 by the reprex package (v2.0.1)

R calculate the correlation coefficient

I have a data frame with 3 variables "age", "confidence" and countryname". I want to campare the correlation between age and confidence in different countries. So I write the following commands to calcuate the correlation coefficient.
correlate <- evs%>%group_by(countryname) %>% summarise(c=cor(age,confidence))
But i found that there are a lot missing value in the output "c". i'm wondering is that mean there are little correlation between IV and DV for this countries, or is there something wrong with my commands?
An NA in the correlation matrix means that you have NA values (i.e. missing values) in your observations. The default behaviour of cor is to return a correlation of NA "whenever one of its contributing observations is NA" (from the manual).
That means that a single NA in the date will give a correlation NA even when you only have one NA among a thousand useful data sets.
What you can do from here:
You should investigate these NAs, count it and determine if your data set contains enough usable data. Find out which variables are affected by NAs and to what extent.
Add the argument use when calling cor. This way you specify how the algorithm shall handle missing values. Check out the manual (with ?cor) to find out what options you have. In your case I would just use use="complete.obs". With only 2 variables, most (but not all) options will yield the same result.
Some more explanation:
age <- 18:35
confidence <- (age - 17) / 10 + rnorm(length(age))
cor(age, confidence)
#> [1] 0.3589942
Above is the correlation with all the data. Now lets set a few NAs and try again:
confidence[c(1, 6, 11, 16)] <- NA
cor(age, confidence) # use argument will implicitely be "everything".
#> [1] NA
This gives NA because some confidence values are NA.
The next statement still gives a result:
cor(age, confidence, use="complete.obs")
#> [1] 0.3130549
Created on 2021-10-16 by the reprex package (v2.0.1)
I know two ways of calculation in R;
via built-in cor() function,
manual calculation with code
Calculation with the built-in cor() function:
# importing df:
state_crime <- read.csv("~/Documents/R/state_crime.csv")
# checking colnames:
colnames(state_crime)
[1] "state" "year" "population"
[4] "murder_rate"
# correlation coefficient between population and murder rate:
cor(state_crime$population, state_crime$murder_rate,
method = "pearson")
[1] -0.0322388
Manual calculation with code:
# creating columns for "deviation from the mean" for both variables:
state_crime <- state_crime %>%
mutate(dev_mean_murderrate =
(state_crime$murder_rate - mean(murder_rate))) %>%
mutate(dev_mean_population =
(state_crime$population - mean(population))) %>%
data.frame()
# implementing the formula: r=∑(x−mx)(y−my)∑(x−mx)2∑(y−my)2
sum(state_crime$dev_mean_population * state_crime$dev_mean_murderrate) /
sqrt(sum((state_crime$murder_rate - mean(state_crime$murder_rate))**2) *
sum((state_crime$population - mean(state_crime$population))**2)
)
[1] -0.0322388

Exclude missing values from model performance calculation

I have a dataset and I want to build a model, preferably with the caret package. My data is actually a time series but the question is not specific to time series, it's just that I work with CreateTimeSlices for the data partition.
My data has a certain amount of missing values NA, and I imputed them separately of the caret code. I also kept a record of their locations:
# a logical vector same size as the data, which obs were imputed NA
imputed=c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE)
imputed[imputed] <- NA; print(imputed)
#### [1] FALSE FALSE FALSE NA FALSE FALSE
I know there is an option in Caret train function to either exclude the NA or impute them with different techniques. That's not what I want. I need to build the model on the already imputed dataset but I want to exclude the imputed points from the calculation of the error indicators (RMSE, MAE, ...).
I don't know how to do this in caret. In my first script I tried to do the whole cross validation manually, and then I had a customized error measure:
actual = c(5, 4, 3, 6, 7, 5)
predicted = c(4, 4, 3.5, 7, 6.8, 4)
Metrics::rmse(actual, predicted) # with all the points
#### [1] 0.7404953
sqrt(mean( (!imputed)*(actual-predicted)^2 , na.rm=T)) # excluding the imputed
#### [1] 0.676757
How can I handle this way of doing in caret? Or is there another way to avoid coding everything by hand?
I don't know if you are looking for this but here is a simple solution by creating a function:
i=which(imputed==F) ## As you have index for NA values
metric_na=function(fun, actual, predicted, index){
fun(actual[index], predicted[index])
}
metric_na(Metrics::rmse, actual, predicted, index = i)
0.676757
metric_na(Metrics::mae, actual, predicted, index = i)
0.54
Also you can just use the index directly while calculating the desired metrics:
Metrics::rmse(actual[i], predicted[i])

Summary statistics for imputed data from Zelig & Amelia

I'm using Amelia to impute the missing values.
While I'm able to use Zelig and Amelia to do some calculations...
How do I use these packages to find the pooled means and standard deviations of the newly imputed data?
library(Amelia)
library(Zelig)
n= 100
x1= rnorm(n,0,1) #random normal distribution
x2= .4*x1+rnorm(n,0,sqrt(1-.4)^2) #x2 is correlated with x1, r=.4
x1= ifelse(rbinom(n,1,.2)==1,NA,x1) #randomly creating missing values
d= data.frame(cbind(x1,x2))
m=5 #set 5 imputed data frames
d.imp=amelia(d,m=m) #imputed data
summary(d.imp) #provides summary of imputation process
I couldn't figure out how to format the code in a comment so here it is.
foo <- function(x, fcn) apply(x, 2, fcn)
lapply(d.imp$imputations, foo, fcn = mean)
lapply(d.imp$imputations, foo, fcn = sd)
d.imp$imputations gives a list of all the imputed data sets. You can work with that list however you are comfortable with to get out the means and sds by column and then pool as you see fit. Same with correlations.
lapply(d.imp$imputations, cor)
Edit: After some discussion in the comments I see that what you are looking for is how to combine results using Rubin's rules for, for example, the mean of imputed data sets generated by Amelia. I think you should clarify in the title and body of your post that what you are looking for is how to combine results over imputations to get appropriate standard errors with Rubin's rules after imputing with package Amelia. This was not clear from the title or original description. "Pooling" can mean different things, particularly w.r.t. variances.
The mi.meld function is looking for a q matrix of estimates from each imputation, an se matrix of the corresponding se estimates, and a logical byrow argument. See ?mi.meld for an example. In your case, you want the sample means and se_hat(sample means) for each of your imputed data sets in the q and se matrices to pass to mi_meld, respectively.
q <- t(sapply(d.imp$imputations, foo, fcn = mean))
se <- t(sapply(d.imp$imputations, foo, fcn = sd)) / sqrt(100)
output <- mi.meld(q = q, se = se, byrow = TRUE)
should get you what you're looking for. For other statistics than the mean, you will need to get an SE either analytically, if available, or by, say, bootstrapping, if not.

T-test with grouping variable

I've got a data frame with 36 variables and 74 observations. I'd like to make a two sample paired ttest of 35 variables by 1 grouping variable (with two levels).
For example: the data frame contains "age" "body weight" and "group" variables.
Now I suppose I can do the ttest for each variable with this code:
t.test(age~group)
But, is there a way to test all the 35 variables with one code, and not one by one?
Sven has provided you with a great way of implementing what you wanted to have implemented. I, however, want to warn you about the statistical aspect of what you are doing.
Recall that if you are using the standard confidence level of 0.05, this means that for each t-test performed, you have a 5% chance of committing Type 1 error (incorrectly rejecting the null hypothesis.) By the laws of probability, running 35 individual t-tests compounds your probability of committing type 1 error by a factor of 35, or more exactly:
Pr(Type 1 Error) = 1 - (0.95)^35 = 0.834
Meaning that you have about an 83.4% chance of falsely rejecting a null hypothesis. Basically what this means is that, by running so many T-tests, there is a very high probability that at least one of your T-tests is going to provide an incorrect result.
Just FYI.
An example data frame:
dat <- data.frame(age = rnorm(10, 30), body = rnorm(10, 30),
weight = rnorm(10, 30), group = gl(2,5))
You can use lapply:
lapply(dat[1:3], function(x)
t.test(x ~ dat$group, paired = TRUE, na.action = na.pass))
In the command above, 1:3 represents the numbers of the columns including the variables. The argument paired = TRUE is necessary to perform a paired t-test.

Resources