How to plot different tests conclusions - r

I'm trying to plot different tests i did for my database regarding the connection between different groups
Here's the dataframe structure
Month District Age Gender Education Disability Religion Occupation JobSeekers GMI
1 2020-01 Dan U17 Male None None Jewish Unprofessional workers 2 0
2 2020-01 Dan U17 Male None None Muslims Sales and costumer service 1 0
3 2020-01 Dan U17 Female None None Other Undefined 1 0
4 2020-01 Dan 18-24 Male None None Jewish Production and construction 1 0
5 2020-01 Dan 18-24 Male None None Jewish Academic degree 1 0
6 2020-01 Dan 18-24 Male None None Jewish Practical engineers and technicians 1 0
ACU NACU NewSeekers NewFiredSeekers
1 0 2 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 0 1 0 0
6 0 1 1 1
I reduced it based on the relevant test, for example for the t-test i did:
dist.newseek <- Cdata %>%
group_by(Month,District) %>%
summarise(NewSeekers=sum(NewSeekers))
Month District NewSeekers
<chr> <chr> <int>
1 2020-01 Dan 6551
2 2020-01 Jerusalem 3589
3 2020-01 North 6154
4 2020-01 Sharon 4131
5 2020-01 South 4469
6 2020-02 Dan 5529
and than performed t test
t.test(NewSeekers ~ District,data=subset(dist.newseek,District %in% c("Dan","South")))
Here's all the tests i did for each group (t test for new seekers vs district, wilcox for age vs new seekers and ANONA for occupation vs new seekers)
I'm looking for a graphic way to show the result of each test.
If you have any idea, please help
# t test for district vs new seekers
# sorting
dist.newseek <- Cdata %>%
group_by(Month,District) %>%
summarise(NewSeekers=sum(NewSeekers))
# performing a t test on the mini table we created
t.test(NewSeekers ~ District,data=subset(dist.newseek,District %in% c("Dan","South")))
# results
Welch Two Sample t-test
data: NewSeekers by District
t = 0.68883, df = 4.1617, p-value = 0.5274
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-119952.3 200737.3
sample estimates:
mean in group Dan mean in group South
74608.25 34215.75
#wilcoxon test
# filtering Cdata to New seekers based on month and age
age.newseek <- Cdata %>%
group_by(Month,Age) %>%
summarise(NewSeekers=sum(NewSeekers))
#performing a wilcoxon test on the subset
wilcox.test(NewSeekers ~ Age,data=subset(age.newseek,Age %in% c("25-34","45-54")))
# Results
Wilcoxon rank sum exact test
data: NewSeekers by Age
W = 11, p-value = 0.4857
alternative hypothesis: true location shift is not equal to 0
ANOVA test
# Sorting occupation and month by new seekers
occu.newseek <- Cdata %>%
group_by(Month,Occupation) %>%
summarise(NewSeekers=sum(NewSeekers))
## Make the Occupation as a factor
occu.newseek$District <- as.factor(occu.newseek$Occupation)
## Get the occupation group means and standart deviations
group.mean.sd <- aggregate(
x = occu.newseek$NewSeekers, # Specify data column
by = list(occu.newseek$Occupation), # Specify group indicator
FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
)
## Run one way ANOVA test
anova_one_way <- aov(NewSeekers~ Occupation, data = occu.newseek)
summary(anova_one_way)
## Run the Tukey Test to compare the groups
TukeyHSD(anova_one_way)
## Check the mean differences across the groups
library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = Occupation)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()
Thanks,
Moshe

Related

Panel regression - Estimators

I am trying to do a panel regression in R.
pdata <- pdata.frame(NEW, index = c("Year"))
And:
R1 <- plm(Market_Cap ~ GDP_growthR + Volatility_IR + FDI
+ Savings_rate, data=pdata, model="between")
However when I want to use the within (or random) estimator, I got the following error:
Error in plm.fit(data, model, effect, random.method, random.models, random.dfcor, : empty model
But, when I use the between estimator, everything is fine. Do you have any explanation and suggestion?
Thank you!
You should heed the advice in the comments.
I addressed a version of the OP's question on CV. If the structure of the data is the same, then you're only observing one cross-sectional unit over time. In your setting, you're observing a single country over many years. If your data was a true panel dataset, you would be observing more than one country over at least two years. For example, I will simulate a small panel data frame.
library(dplyr)
library(plm)
set.seed(12345)
panel <- tibble(
country = c(rep("Spain", 5), rep("France", 5), rep("Croatia", 5)),
year = rep(2016:2020, 3), # each country is observed over 5 years
x = rnorm(15), # sample 15 random deviates (5 per country)
y = sample(c(10000:100000), size = 15) # sample incomes (range: 10,000 - 100,000)
) %>%
mutate(
France = ifelse(country == "France", 1, 0),
Croatia = ifelse(country == "Croatia", 1, 0),
y_2016 = ifelse(year == 2016, 1, 0),
y_2017 = ifelse(year == 2017, 1, 0),
y_2018 = ifelse(year == 2018, 1, 0),
y_2019 = ifelse(year == 2019, 1, 0),
y_2020 = ifelse(year == 2020, 1, 0)
)
Inside of the mutate() function I appended dummies for all countries and all years, excluding one country and one year. In your other question, you estimate time fixed effects. Software invariably drops one year to avoid collinearity. You don't need to append the dummies, but they are helpful for explication purposes. Here is a classic panel data frame:
# Panel - varies across two dimensions (country + time)
# 3 countries observed over 5 years for a total of 15 country-year observations
# A tibble: 15 x 10
country year x y France Croatia y_2017 y_2018 y_2019 y_2020
<chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Spain 2016 0.586 81371 0 0 0 0 0 0
2 Spain 2017 0.709 10538 0 0 1 0 0 0
3 Spain 2018 -0.109 26893 0 0 0 1 0 0
4 Spain 2019 -0.453 71363 0 0 0 0 1 0
5 Spain 2020 0.606 43308 0 0 0 0 0 1
6 France 2016 -1.82 42544 1 0 0 0 0 0
7 France 2017 0.630 88187 1 0 1 0 0 0
8 France 2018 -0.276 91368 1 0 0 1 0 0
9 France 2019 -0.284 65563 1 0 0 0 1 0
10 France 2020 -0.919 22061 1 0 0 0 0 1
11 Croatia 2016 -0.116 80390 0 1 0 0 0 0
12 Croatia 2017 1.82 48623 0 1 1 0 0 0
13 Croatia 2018 0.371 93444 0 1 0 1 0 0
14 Croatia 2019 0.520 79582 0 1 0 0 1 0
15 Croatia 2020 -0.751 33367 0 1 0 0 0 1
As #DaveArmstrong correctly noted, you should specify the panel indexes. First, we specify a panel data frame, then we estimate the model.
pdata <- pdata.frame(panel, index = c("year", "country"))
random <- plm(y ~ x, model = "random", data = pdata)
A one-way random effects model is fit. The call to summary() will produce the following (abridged output):
Call:
plm(formula = y ~ x, data = pdata, model = "random")
Balanced Panel: n = 5, T = 3, N = 15
Effects:
var std.dev share
idiosyncratic 685439601 26181 0.819
individual 151803385 12321 0.181
theta: 0.2249
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-49380 -17266 6221 17759 32442
Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 58308.0 8653.7 6.7380 1.606e-11 ***
x 7777.0 8808.9 0.8829 0.3773
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
But your data does not have this structure, hence the warning message. In fact, your data is similar to carving out one country from this panel. For example, suppose we winnowed down the data frame to Croatian observations only. The following code takes a subset of the previous data frame:
croatia_only <- panel %>%
filter(country == "Croatia") # grab only the observations from Croatia
Here, longitudinal variation only exists for one country. In other words, by restricting attention to Croatia, we cannot exploit the variation across countries; we only have variation in one dimension! The resulting data frame looks like the following:
# Time Series - varies across one dimension (time)
# A tibble: 5 x 10
country year x y France Croatia y_2017 y_2018 y_2019 y_2020
<chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Croatia 2016 -0.116 80390 0 1 0 0 0 0
2 Croatia 2017 1.82 48623 0 1 1 0 0 0
3 Croatia 2018 0.371 93444 0 1 0 1 0 0
4 Croatia 2019 0.520 79582 0 1 0 0 1 0
5 Croatia 2020 -0.751 33367 0 1 0 0 0 1
Now I will re-estimate a random effects model with one country:
pdata <- pdata.frame(croatia_only, index = c("year", "country"))
random_croatia <- plm(y ~ x , model = "random", data = pdata)
This should reproduce your error message (i.e., empty model). Note, you only have variation within one country! As you correctly noted, a "between-effects" model is estimable, but not for reasons you might presume. A "between effects" model averages over all years within a country, then it runs ordinary least squares on the 'averaged' data. In your setting, taking the average over your time series results in a country mean. And since you only observe one country, then you only have one observation. Such a model is inestimable. However, you can 'pool' together all of your yearly observations for one country and run a linear model instead. That is what you're doing. To test this out using one country, try comparing the "between" model with the "pooling" model. They should produce identical estimates of x.
# Run this using the croatia_only data frame
summary(plm(y ~ x , model = "between", data = pdata))
summary(plm(y ~ x , model = "pooling", data = pdata))
It should be painfully obvious now, but model = "pooling" is equivalent to running lm().
If you want me to tie this into your previous post, try estimating a linear model with separate dummies for all years as covariates. You will quickly discover that you have no residual degrees of freedom, which is exactly the problem outlined in your other post.
In sum, I would look for data from other countries. Once you do that, you can use plm() for all it's worth.

How to convert the values of a column of a data frame to 0s and 1s?

I am working on a negative binomial regression model, which predicts the number of initiated private member bills by MPs in Japan based on their voteshare, age, sex and parliamentary office. In order to calculate the AME for the parliamentary_office variable I need to create two new data frames df.0 and df.1. As an example, here's the data frame df.0:
(Intercept) voteshare age sexmale term parliamentary_office
1 1 37.92 57 male 0 0
2 1 45.99 65 male 5 0
3 1 36.18 59 female 3 0
4 1 43.3 47 male 1 0
5 1 45.48 58 male 5 0
6 1 31.89 44 male 0 0
How to convert the the sexmale column to numbers?
Here is my code:
#rm(list=ls())
library(foreign)
dat <- read.dta(file = 'activity.dta', convert.factors = FALSE)
dat_clear <- na.omit(dat)
datc_2012 <- dat_clear[dat_clear$election == 2012, ]
library(MASS)
summary(m2.negbin <- glm.nb(num_pmbs_initiated ~
voteshare + age + sex + term
+ parliamentary_office, data = datc_2012,
link = "log"))
df.0 <- data.frame(cbind(1,
m2.negbin[["model"]]$voteshare,
m2.negbin[["model"]]$age,
m2.negbin[["model"]]$sex,
m2.negbin[["model"]]$term,
m2.negbin[["model"]]$parliamentary_office))
colnames(df.0) <- names(coef(m2.negbin))
df.1 <- df.0
df.0[,"parliamentary_office"] <- 0
df.1[,"parliamentary_office"] <- 1
you can use the ifelse function if you only have only male and female and need to change the column into 0-1 numbers.
df.0$sexmale <- sapply(df.0$sexmale, function (x) ifelse(x == "male", 0, 1))

Average Marginal Effects in R with complex interaction terms

I am using R to compute the linear regression on the following model, as well as find the marginal effects of age on pizza at specific points (20,30,40,50,55).
mod6.22c <- lm(pizza ~ age + income + age*income +
I((age*age)*income), data = piz4)
The problem I am running into is when using the margins command, R does not see interaction terms that are inserted into the lm with I((age x age) x income). The margins command will only produce accurate average marginal effects when the interaction terms are in the form of variable1 x variable1. I also can't create a new variable in my table table$newvariable <- table$variable1^2, because the margins command won't identify newvariable as related to variable1.
This has been fine up until now, where my interaction terms have only been a quadratic, or an xy interaction, but now I am at a point where I need to calculate the average marginal effects with the interaction term AGE^2xINCOME included in the model, but the only way I can seem to get the summary lm output to be correct is by using I(age^2*(income)) or by creating a new variable in my table. As stated before, the margins command can't read I(age^2*(income)), and if I create a new variable, the margins command doesn't recognize the variables are related, and the average marginal effects produced are incorrect.
The error I am receiving:
> summary(margins(mod6.22c, at = list(age= c(20,30,40,50,55)),
variables = "income"))
Error in names(classes) <- clean_terms(names(classes)) :
'names' attribute [4] must be the same length as the vector [3]
I appreciate any help in advance.
Summary of data:
Pizza is annual expenditure on pizza, female, hs, college and grad are dummy variables, income is in thousands of dollars per year, age is years old.
> head(piz4)
pizza female hs college grad income age agesq
1 109 1 0 0 0 19.5 25 625
2 0 1 0 0 0 39.0 45 2025
3 0 1 0 0 0 15.6 20 400
4 108 1 0 0 0 26.0 28 784
5 220 1 1 0 0 19.5 25 625
6 189 1 1 0 0 39.0 35 1225
Libraries used:
library(data.table)
library(dplyr)
library(margins)
tldr
This works:
mod6.22 <- lm(pizza ~ age + income + age*income, data = piz4)
**summary(margins(mod6.22, at = list(age= c(20,30,40,50,55)), variables = "income"))**
factor age AME SE z p lower upper
income 20.0000 4.5151 1.5204 2.9697 0.0030 1.5352 7.4950
income 30.0000 3.2827 0.9049 3.6276 0.0003 1.5091 5.0563
income 40.0000 2.0503 0.4651 4.4087 0.0000 1.1388 2.9618
income 50.0000 0.8179 0.7100 1.1520 0.2493 -0.5736 2.2095
income 55.0000 0.2017 0.9909 0.2036 0.8387 -1.7403 2.1438
This doesn't work:
mod6.22c <- lm(pizza ~ age + income + age*income + I((age * age)*income), data = piz4)
**summary(margins(mod6.22c, at = list(age= c(20,30,40,50,55)), variables = "income"))**
Error in names(classes) <- clean_terms(names(classes)) :
'names' attribute [4] must be the same length as the vector [3]
How do I get margins to read my interaction variable I((age*age)*income)?

How to account for categorical variables in calculating a risk score from regression model?

I have a data set which has a number of variables that I'd like to use to generate a risk score for getting a disease.
I have created a basic version of what I'm trying to do.
The dataset looks like this:
ID DISEASE_STATUS AGE SEX LOCATION
1 1 20 1 FRANCE
2 0 22 1 GERMANY
3 0 24 0 ITALY
4 1 20 1 GERMANY
5 1 20 0 ITALY
So the model I ran was:
glm(disease_status ~ age + sex + location, data=data, family=binomial(link='logit'))
The beta values produced by this model were as follows:
bage = −0.193
bsex = −0.0497
blocation= 1.344
To produce a risk score, I want to multiply the values for each individual by the beta values, eg:
risk score = (-0.193 * 20 (age)) + (-0.0497 * 1 (sex)) + (1.344 * ??? (location))
However, what value would I use to multiply the beta score for location by?
Thank you!

calculate gender percentage from grouped data frame in R

I have fairly large data frame that includes information on individuals divided into treatment groups. I am trying to generate variable means and gender percentages per group. I was able to calculate the means but I am not sure how to get the gender percentages.
Below, I generated a small replica of what my data looks like:
library(plyr)
#create variables and data frame
sampleid<-seq(1:100)
gender = rep(c("female","male"),c(50,50))
score <- rnorm(100)
age<-sample(25:35,100,replace=TRUE)
treatment <- rep(seq(1:5), each=4)
d <- data.frame(sampleid,gender,age,score, treatment)
>head(d)
sampleid gender age score treatment
1 1 female 34 1.6917201 1
2 2 female 26 -1.6189545 1
3 3 female 28 1.2867895 1
4 4 female 34 -0.5027578 1
5 5 female 29 -1.3652895 2
6 6 female 26 -2.4430843 2
I obtain the mean of each numeric column by:
groupstat<-ddply(d, .(treatment),numcolwise(mean))
which gives:
treatment sampleid age score
1 1 42.5 29.15 0.142078574
2 2 46.5 29.50 -0.261492514
3 3 50.5 30.50 -0.188393235
4 4 54.5 30.45 0.003526078
5 5 58.5 30.55 0.062996737
However I also need an additional column "Percent Female", which should give me the percentage of females within each treatment group 1:5.
Can someone help me in how to add this?
Try this out
groupstat<-ddply(d, .(treatment),summarise,
meansc= mean(score),
meanage= mean(age),
meanID= mean(sampleid),
nfem= length(gender[gender=="female"]), # number females per treatment group
nmale= length(gender[gender=="male"]), # number of males per treatment group
percentfem= nfem/(nfem+nmale)) # percent females by treatment group
I would first split into treatment groups (split(d, f = d$treatment)) and than calc the means for each group (function(x) sum(x$gender == "female")/length(x$gender):
sapply(split(d, f = d$treatment), function(x) sum(x$gender == "female")/length(x$gender))

Resources