T test using column variable from 2 different data frames in R

T test using column variable from 2 different data frames in R - r

I am attempting to conduct a t test in R to try and determine whether there is a statistically significant difference in salary between US and foreign born workers in the Western US. I have 2 different data frames for the two groups based on nativity, and want to compare the column variable I have on salary titled "adj_SALARY". For simplicity, say that there are 3 observations in the US_Born_west frame, and 5 in the Immigrant_West data frame.
US_born_West$adj_SALARY=30000, 25000,22000
Immigrant_West$adj_SALARY=14000,20000,12000,16000,15000
#Here is what I attempted to run:
t.test(US_born_West$adj_SALARY~Immigrant_West$adj_SALARY, alternative="greater",conf.level = .95)
However I received this error message: "Error in model.frame.default(formula = US_born_West$adj_SALARY ~ Immigrant_West$adj_SALARY) :
variable lengths differ (found for 'Immigrant_West$adj_SALARY')"
Any ideas on how I can fix this? Thank you!

US_born_West$adj_SALARY and Immigrant_West$adj_SALARY are of unequal length. Using formula interface of t.test gives an error about that. We can pass them as individual vectors instead.
t.test(US_born_West$adj_SALARY, Immigrant_West$adj_SALARY,
alternative="greater",conf.level = .95)

Related

TukeyHSD returns too many values in R

I'm very new to R (and statistics) and I searched a lot for a possible solution, but couldn't find any.
I have a data set with around 18000 entries, which contain two columns: "rentals" and "season". I want to analyse if there is a difference between the mean of the rentals depending on the season using an one-way ANOVA.
My data looks like this:
rentals
season
23
1
12
1
17
2
16
2
44
3
22
3
2
4
14
4
First I calculate the SD and MEAN of the groups (season):
anova %>%
group_by(season) %>%
summarise(
count_season = n(),
mean_rentals = mean(rentals, na.rm = TRUE),
sd_rentals = sd(rentals, na.rm = TRUE))
This is the result:
Then I perform the one-way ANOVA:
anova_one_way <- aov(season~as.factor(rentals), data = anova)
summary(anova_one_way)
<!-- I use "as.factor" on rentals, because otherwise I'm getting an error with TukeyHSD -->
Result:
Here comes the tricky part. I perform a TukeyHSD test:
TukeyHSD(anova_one_way)
And the results are very disappointing. TukeyHSD returns 376896 rows, while I expect it to return just a few, comparing the seasons with each other. It looks like every single "rentals" row is being handled as a single group. This seems to be very wrong but I can't find the cause. Is this a common TukeyHSD behaviour considering the big data set or is there an error in my code or logic, which causes this enormous unreadable list of values as a return?
Here is a small insight on how it looks like (and it goes on until 376896).

The terms are the wrong way around in your aov() call. Rentals is the outcome (dependent) variable, season is the predictor (independent) variable.
So you want:
anova_one_way <- aov(rentals ~ factor(season), data = anova)

Get rid of 0s before running chi-square test

My variable RaceCat looks like this after coding table(data$RaceCat)
I want to run a chi-square test but I know I need to get rid of the races American Indian and Middle Eastern since those have 0s. American Indian is coded as 5 and Middle Eastern is 4. My thought process is to do data%>% filter(RaceCat!=5)%>% filter(RaceCat!=4)
However, when I try to do that, R says length of 'dimnames' [2] not equal to array extent.
Here is another thing I tried to do:
hivneg<-droplevels(hivneg)
a<- table(hivneg$RaceCat, hivneg$PrEP2)
chisq.test(a, correct=F)
However, the table() and chisq.test() won't run. It says
Warning message in mapply(FUN = f, ..., SIMPLIFY = FALSE):
“longer argument not a multiple of length of shorter”

From the image showed, it seems that the column 'RaceCat' is factor class with some unused levels. An option is to update the data with droplevels which removes those unused levels in any of the factor columns
data <- droplevels(data)
NOTE: table just returns the frequency count of each of the unique values/ or if it is a factor, the count of each of the levels of the factor.

SVM Predict Levels not matching between test and training data

I'm trying to predict a binary classification problem dealing with recommending films.
I've got a training data set of 50 rows (movies) and 6 columns (5 movie attributes and a consensus on the film).
I then have a test data set of 20 films with the same columns.
I then run
pred<-predict(svm_model, test)
and receive
Error in predict.svm(svm_model, test) : test data does not match model !.
From similar posts, it seems that the error is because the levels don't match between the training and test datasets. This is true and I've proved it by comparing str(test) and str(train). However, both datasets come from randomly selected films and will always have different levels for their categorical attributes. Doing
levels(test$Attr1) <- levels(train$Attr1)
changes the actual column data in test, thus rendering the predictor incorrect. Does anyone know how to solve this issue?
The first half dozen rows of my training set are in the following link.
https://justpaste.it/1ifsx

You could do something like this, assuming Attr1 is a character:
Create a levels attribute with the unique values from attribute1 from both test and train.
Create a factor on train and test attribute1 with all the levels found in point 1.
levels <- unique(c(train$Attr1, test$Attr1))
test$Attr1 <- factor(test$Attr1, levels=levels)
train$Attr1 <- factor(train$Attr1, levels=levels)
If you do not want factos, add as.integer to part of the code and you will get numbers instaed of factors. That is sometimes handier in models like xgboost and saves on one hot encoding.
as.integer(factor(test$Attr1, levels=levels))

How to handle errors in predict function of R?

I have a dataframe df, I am building an machine learning model (C5.0 decision tree) to predict the class of a column (loan_approved):
Structure (not real data):
id occupation income loan_approved
1 business 4214214 yes
2 business 32134 yes
3 business 43255 no
4 sailor 5642 yes
5 teacher 53335 no
6 teacher 6342 no
Process:
I randomly split the data frame into test and train, learned on train
dataset (rows 1,2,3,5,6 train and row 4 as test)
In order to account for new categorical levels in one or many column, I used try function
Function:
error_free_predict = function(x){
output = tryCatch({
predict(C50_model, newdata = test[x,], type = "class")
}, error = function(e) {
"no"
})
return(output)
}
Applied the predict function:
test <- mutate(test, predicted_class = error_free_predict(1:NROW(test)))
Problem:
id occupation income loan_approved predicted_class
1 business 4214214 yes no
2 business 32134 yes no
3 business 43255 no no
4 sailor 5642 yes no
5 teacher 53335 no no
6 teacher 6342 no no
Question:
I know this is because the test data frame had a new level that was not present in train data, but should not my function work all cases except this?
P.S: did not use sapply because it was too slow

There are two parts of this problem.
First part of problem comes during training the model because categorical variables are not equally divided in between train and test if one do random splitting. In your case say you have only one record with occupation "sailor" then it is possible that it will end up in test set when you do random split. Model built using train dataset would have never seen impact of occupation "sailor" and hence it will throw error. In more generalized case it is possible some other categorical variable level goes entirely to test set after random splitting.
So instead of dividing the data randomly in between train and test you can do stratified sampling. Code using data.table for 70:30 split is :
ind <- total_data[, sample(.I, round(0.3*.N), FALSE),by="occupation"]$V1
train <- total_data[-ind,]
test <- total_data[ind,]
This makes sure any level is divided equally among train and test dataset. So you will not get "new" categorical level in test dataset; which in random splitting case could be there.
Second part of the problem comes when model is in production and it encounters a altogether new variable which was not there in even training or test set. To tackle this problem one can maintain a list of all levels of all categorical variables by using
lvl_cat_var1 <- unique(cat_var1) and lvl_cat_var2 <- unique(cat_var2) etc. Then before predict one can check for new level and filter:
new_lvl_data <- total_data[!(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)]
pred_data <- total_data[(var1 %in% lvl_cat_var1 & var2 %in% lvl_cat_var2)]
then for the default prediction do:
new_lvl_data$predicted_class <- "no"
and full blown prediction for pred_data.

I generally do this using a loop where any levels outside of the train would be recoded as NA by this function. Here train is the data that you used for training the model and test is the data which would be used for prediction.
for(i in 1:ncol(train)){
if(is.factor(train[,i])){
test[,i] <- factor(test[,i],levels=levels(train[,i]))
}
}
Trycatch is an error handling mechanism, i.e. after the error has been encountered. It would not be applicable unless you would like to do something different after the error has been encountered. But you would still like to run the model, then this loop would take care of the new levels.

R unexpected NA output from RandomForest

I'm working with a data set that has a lot of NA's. I know that the first 6 columns do NOT have any NA's. Since the first column is an ID column I'm omitting it.
I run the following code to select only lines that have values in the response column:
sub1 <- TrainingData[which(!is.na(TrainingData[,70])),]
I then use sub1 as the data set in a randomForest using this code:
set.seed(448)
RF <- randomForest(sub1[,c(2:6)], sub1[,70]
,do.trace=TRUE,importance=TRUE,ntree=10,,forest=TRUE)
then I run this code to check the output for NA's:
> length(which(is.na(RF$predicted)))
[1] 65
I can't figure out why I'd be getting NA's if the data going in is clean.
Any suggestions?

I think you should use more trees. Because predicted values are preditions for the out-of-bag set. And if number of trees very small some cases are never present in out-of-bag set, because this set forms randomly.