Grouping factors in a pooled 2 sample t-test - r

I have a 2*2 table of 7 men and 11 women's weight (saved as weights_gender.csv), and aim to perform a pooled t-test. I have assigned the CSV file as weight = read.csv("weights_gender.csv"), but whenever I try to run t.test(weight$men~weight$women, var.equal=TRUE), it keeps on printing this message:
grouping factor must have exactly 2 levels.
What is the issue?

Try ...
t.test(x = weight$men, y = weight$women, var.equal = TRUE)
The way you were specifying the command it thought you wanted men's weight grouped by women which of course is not what you want.
Results...
Two Sample t-test
data: weight$men and weight$women
t = 5.9957, df = 16, p-value = 1.867e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
15.26250 31.95828
sample estimates:
mean of x mean of y
77.42857 53.81818
Data
weight <- data.frame(
men = c(88, 90, 78, 75, 70, 72, 69, NA, NA, NA, NA),
women = c(45, 57, 54, 62, 60, 59, 44, 43, 67, 50, 51)
)

Your question is a bit "theoretical" so I'll make it more concrete
Here I make two data frames with data about men's and women's weights, and labeling them.
df_m <- tibble(weight = 170 + 30*rnorm(7), sex = "Male")
df_f <- tibble(weight = 130 + 30*rnorm(11), sex = "Female")
Next we combine the data and set sex to be a factor variable
df_all <- rbind(df_m, df_f)
df_all[, 'sex'] <- lapply(df_all[, 'sex'], as.factor)
Finally we apply the t-test.
t.test(weight ~ sex, data = df_all, var.equal = TRUE)
My result was
Two Sample t-test
data: weight by sex
t = -5.2104, df = 16, p-value = 8.583e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-89.84278 -37.87810
sample estimates:
mean in group Female mean in group Male
120.2316 184.0921

Related

Creating data table of points above/below abline in ggplot2

Is it possible to identify data points above a geom_abline in ggplot, and to create a new data table separating these data points using data.table?
I have a panel dataset with 150 unique ID's, and have fit a fixed effects model using plm(). Here is a sample of the dataset:
data <- data.frame(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
progenyMean = c(90,78,92,69,86,73,82,85,91),
damMean = c(89,89,72,98,95,92,94,87,89)
ID, year, progenyMean, damMean
1, 1, 70, 69
1, 2, 68, 69
1, 3, 72, 72
1, 4, 69, 68
2, 1, 76, 75
2, 2, 73, 80
3, 1, 72, 74
3, 2, 75, 67
3, 3, 71, 69
# Fixed Effects Model in plm
fixed <- plm(progenyMean ~ damMean, data, model= "within", index = c("ID","year"))
I have plotted the response progenyMean vs damMean using the following code:
plotFunction <- function(aggData, year){
ggplot(aggData, aes(x=damMeanCentered, y=progenyMean3Y)) +
geom_point() +
geom_abline(slope=fixed$coefficients, intercept=71.09, colour='dodgerblue1', size=1)
# The intercept 71.09 was calculated using the mean of fixef(fixed)
}
plotFunction(data, '(2005 - 2012)')
Is it possible to identify the points above/below the geom_abline in ggplot and create a new data table separating these data points using data.table?
It is not clear where the intercept came from, but nevertheless the trick is
add a predict to your dataset using the regression model (in your case fixed). Then filter out actual values that are higher than the predict.
library(dplyr)
data %>%
mutate(predict = predict(fixed, newdata = data)) %>%
filter(progenyMean > predict)
First make the predictions
data[,newpredict:=predict(fixed, newdata=data)]
It's not clear what you want the new data.table to look like but you'd get the values above predictions by doing
data[progencyMean>newpredict]
For below, you'd obviously just change the > to <.

Identifying data points above fixed effects regression using data.table

I want to identify the data points above a regression line. I have a panel data set which I have fit a fixed effects model:
data <- data.frame(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
progenyMean = c(90,78,92,69,86,73,82,85,91),
damMean = c(89,89,72,98,95,92,94,87,89)
ID, year, progenyMean, damMean
1, 1, 70, 69
1, 2, 68, 69
1, 3, 72, 72
1, 4, 69, 68
2, 1, 76, 75
2, 2, 73, 80
3, 1, 72, 74
3, 2, 75, 67
3, 3, 71, 69
# Fixed Effects Model in plm
fixed <- plm(progenyMean ~ damMean, data, model= "within", index = c("ID","year"))
I have plotted progenyMean vs damMean with the fixed effects regression line:
I want to identify the ID's above this regression line.
I have computed the predicted values of the fixed effects model using the following code (based off code from this post)
fitted <- as.numeric(fixed$model[[1]] - fixed$residuals)
> fitted
[1] 71.24338 79.03766 74.86613 71.34263 70.83020 71.56797 72.17324 74.54755 71.16720 73.37487
[11] 70.58863 69.27203 71.05852 59.72911 63.43947 68.69871 67.25271 75.68397 76.30475 81.12128
Is it possible to identify the ID's above the fixed effects regression line using the predicted values above and data.table in R?
Use residuals function. Positive residual = points above the line, negative = points below the line.
library(plm)
library(tidyverse)
library(ggplot2)
data <- data.frame(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
progenyMean = c(90,78,92,69,86,73,82,85,91),
damMean = c(89,89,72,98,95,92,94,87,89))
fixed <- plm(progenyMean ~ damMean, data, model= "within", index = c("ID","year"))
residuals(fixed)
data %>% ggplot(aes(damMean, progenyMean)) +
geom_point(data=data %>% filter(residuals(fixed)>0), col="red")+
geom_point(data=data %>% filter(residuals(fixed)<0), col="blue")
data %>% mutate(
test = ifelse(residuals(fixed)>0, "up", "down") %>% factor()
) %>% group_by(test) %>% summarise(
n = n()
)

Am I using xgboost() correctly (in R)?

I'm a beginner with machine learning (and also R). I've figured out how to run some basic linear regression, elastic net, and random forest models in R and have gotten some decent results for a regression project (with a continuous dependent variable) that I'm working on.
I've been trying to learning how to use the gradient boosting algorithm and, in particular, the xgboost() command. My results are way worse here, though, and I'm not sure why.
I was hoping someone could take a look at my code and see if there are any glaring errors.
# Create training data with and without the dependent variable
train <- data[1:split, ]
train.treat <- select(train, -c(y))
# Create test data with and without the dependent variable
test <- data[(split+1):nrow(data), ]
test.treat <- select(test, -c(y))
# Load the package xgboost
library(xgboost)
# Run xgb.cv
cv <- xgb.cv(data = as.matrix(train.treat),
label = train$y,
nrounds = 100,
nfold = 10,
objective = "reg:linear",
eta = 0.1,
max_depth = 6,
early_stopping_rounds = 10,
verbose = 0 # silent
)
# Get the evaluation log
elog <- cv$evaluation_log
# Determine and print how many trees minimize training and test error
elog %>%
summarize(ntrees.train = which.min(train_rmse_mean), # find the index of min(train_rmse_mean)
ntrees.test = which.min(test_rmse_mean)) # find the index of min(test_rmse_mean)
# The number of trees to use, as determined by xgb.cv
ntrees <- 25
# Run xgboost
model_xgb <- xgboost(data = as.matrix(train.treat), # training data as matrix
label = train$y, # column of outcomes
nrounds = ntrees, # number of trees to build
objective = "reg:linear", # objective
eta = 0.001,
depth = 10,
verbose = 0 # silent
)
# Make predictions
test$pred <- predict(model_xgb, as.matrix(test.treat))
# Plot predictions vs actual bike rental count
ggplot(test, aes(x = pred, y = y)) +
geom_point() +
geom_abline()
# Calculate RMSE
test %>%
mutate(residuals = y - pred) %>%
summarize(rmse = sqrt(mean(residuals^2)))
How does this look?
Also, one thing I don't get about xgboost() is why I have to take out the dependent variable from the dataset in the "data" option and then add it back in the "label" option. Why do we do this?
My dataset has 809 observations and 108 independent variables. Here is an arbitrary subset:
structure(list(year = c(2019, 2019, 2019, 2019), ht = c(74, 76,
74, 73), wt = c(223, 234, 215, 215), age = c(36, 29, 32, 24),
gp_l1 = c(16, 16, 11, 14), gp_l2 = c(7, 0, 16, 0), gp_l3 = c(16,
15, 16, 0), gs_l1 = c(16, 16, 11, 13), gs_l2 = c(7, 0, 16,
0), gs_l3 = c(16, 15, 16, 0), cmp_l1 = c(372, 430, 226, 310
), cmp_l2 = c(154, 0, 297, 0), cmp_l3 = c(401, 346, 364,
0), att_l1 = c(597, 639, 365, 486), y = c(8, 71.5, 26, 22
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
My RMSE from this xgboost() model is 31.7. Whereas my random forest and glmnet models give RMSEs around 13. The prediction metric I'm comparing to has RMSE of 15.5. I don't get why my xgboost() model does so much worse than my random forest and glmnet models.

Two different multiple GLM poisson model regression, mean points and confidence interval

I'd like to create a plot in ggplot2 that combines two different multiple GLM poisson model ajusted, mean points and confidence interval (IC 95%). But my mean point representation doesn't work.
#Artificial data set
Consumption <- c(501, 502, 503, 504, 26, 27, 55, 56, 68, 69, 72, 93)
Gender <- gl(n = 2, k = 6, length = 2*6, labels = c("Male", "Female"), ordered = FALSE)
Income <- c(5010, 5020, 5030, 5040, 260, 270, 550, 560, 680, 690, 720, 930)
df3 <- data.frame(Consumption, Gender, Income)
df3
# GLM Regression
fm1 <- glm(Consumption~Gender+Income, data=df3, family=poisson)
summary(fm1)
# ANOVA
anova(fm1,test="Chi")
#Genders are different than I ajusted one model for male and another for Female
#Male model
df4<-df3[df3$Gender=="Male",]
fm2 <- glm(Consumption~Income, data=df4, family=poisson)
summary(fm2)
#Female model
df5<-df3[df3$Gender=="Female",]
fm3 <- glm(Consumption~Income, data=df5, family=poisson)
summary(fm3)
#Create preditions amd confidence interval
Predictions <- c(predict(fm2, type="link", se.fit = TRUE),
predict(fm3, type="link", se.fit = TRUE))
df3_combined <- cbind(df3, Predictions)
df3_combined$UCL<-df3_combined$fit + 1.96*df3_combined$se.fit
df3_combined$LCL<-df3_combined$fit - 1.96*df3_combined$se.fit
df3_combined<-df3_combined[,-(6:9)]
df3_combined<-as.data.frame(df3_combined)
#Create mean values for plot this values
library(dplyr)
df<-df3_combined %>%
group_by(Income, Gender) %>%
summarize(Consumption = mean(Consumption, na.rm = TRUE))
df<-as.data.frame(df)
#Plot
library(tidyverse)
library(ggplot2)
df3_combined %>%
gather(type, value, Consumption) %>%
ggplot(mapping=aes(x=Income, y=Consumption, color = Gender)) +
geom_point(df,mapping=aes(x=Income, y=Consumption, color = Gender)) +
geom_line(mapping=aes(x=Income, y=exp(fit))) +
geom_smooth(mapping=aes(ymin = exp(LCL), ymax = exp(UCL)), stat="identity")
#
I don't see the mean values created in df object in my output plot and I don't know why.

Dummy for lower tertile normalized by multiple variables

I need to find the proportion of respondents whose hand grip strength was in the bottom tertile normalised for age, gender, weight and height.
First, I am trying to normalize grip strength given age, male, height and weight. Secondly, I try to create a variable given the tertiles of this normalized grip strength variable. And lastly, I try to create a dummy equal to 1 if an individual is in the lower tertile.
So far I have constructed this code, but it is not really working:
df <- within(df, normal <- ave(grip, male, age, weight, height ,FUN=function(x) (x-min(x))/diff(range(x))))
df$tertile <- ave(df$normal,
FUN=function(x){cut(x, labels=1:3, breaks=quantile(x, probs = 0:3/3, na.rm = TRUE), include.lowest=TRUE)})
df$lowgrip <- ifelse(df$tertile==1, 1, 0)
Data could look like this:
set.seed(123)
df <- data.frame(
age = sample(50:79, 40, replace = TRUE),
male = sample(c("1", "0"), 40, replace = TRUE),
grip = sample(5:80, 40, replace = TRUE),
weight = sample(50:100, 40, replace = TRUE),
height = sample(150:200, 40, replace = TRUE)
)
However, my real data has around 7000 observations.

Resources