I would like to do regression to find relationship between ln_wage and education level.
The education levels are used as dummy variables. The data looks like this:
unique(data1$grade)
'MiddleSchool' 'University' 'Primary' 'Vocational' 'HighSchool' 'PostSecondary'
AT first, the reference group is HighSchool as I checked with contrast() , it returns:
data1$grade = as.factor(data1$grade)
contrasts(data1$grade)
MiddleSchool PostSecondary Primary University Vocational
HighSchool 0 0 0 0 0
MiddleSchool 1 0 0 0 0
PostSecondary 0 1 0 0 0
Primary 0 0 1 0 0
University 0 0 0 1 0
Vocational 0 0 0 0 1
Then, I proceed to do regression.
olsreg <- lm(ln_wage ~ age + age_sqr + grade + indus + year -1, data=data1)
summary(olsreg)$coefficient
Estimate Std. Error
> gradePrimary 2.561 1.127378e-02
> gradeMiddleSchool 2.680 1.108296e-02
> gradeHighSchool 2.743 1.128950e-02
> gradeVocational 2.820 1.165518e-02
> gradePostSecondary 2.915 1.147913e-02
> gradeUniversity 3.191 1.136915e-02
The results show every category, not one is omitted as the reference group.
Then, I proceed to use relevel to change the reference group to Primary.
data1$grade = relevel(data1$grade, ref='Primary')
contrasts(data1$grade)
HighSchool MiddleSchool PostSecondary University Vocational
Primary 0 0 0 0 0
HighSchool 1 0 0 0 0
MiddleSchool 0 1 0 0 0
PostSecondary 0 0 1 0 0
University 0 0 0 1 0
Vocational 0 0 0 0 1
The reference group changed to Primary.
Then, I do regression again.
olsreg <- lm(ln_wage ~ age + age_sqr + grade + indus + year -1, data=data1)
summary(olsreg)$coefficient
Estimate Std. Error
> gradePrimary 2.561 1.127378e-02
> gradeMiddleSchool 2.680 1.108296e-02
> gradeHighSchool 2.743 1.128950e-02
> gradeVocational 2.820 1.165518e-02
> gradePostSecondary 2.915 1.147913e-02
> gradeUniversity 3.191 1.136915e-02
The coefficients do not change at all from the previous regression, and there also seems to be no baseline.
What did I do wrong here? Any advice is very much appreciated.
I am writing a paper on the gender pay gap in Lithuania, and my goal is to interpret statistical survey data, determining the factors partially explaining the wage gap (such as age, tenure, education, etc.), using the Oaxaca-Blinder decomposition.
I have very little knowledge of 'R', although in University I did have some classes, mostly about linear regression models. Please excuse if my questions are not well-formulated. Any comments and advice will be greatly appreciated.
I came across the 'Oaxaca' package for 'R', but have not been able to fully adapt the 'formula' function to my data. The instructions of the package:
https://cran.r-project.org/web/packages/oaxaca/oaxaca.pdf
My problem is not understanding how to properly use the 'formula' function for my data, which contains a lot of non-numeric variables that I tried to turn into indicator ("dummy") variables with values of "0" or "1".
Specifically, I cannot adjust the formula to make the result invariant to the selected reference category. When I try to do this, I get the error message: "Variables d1 + d2 + d3 + ... in argument 'formula' must indicate membership in mutually exclusive categories."
The 'Oaxaca' formula that more or less works for me looks like this:
1) y ~ x1 + x2 + x3 + ... | z
Here y is the dependent variable, x1 + x2 + x3 + ... are explanatory
variables and z is an indicator variable that states whether an observation belongs to Group B (female) or group A (male).
The formula adjusted for reference category:
2) y ~ x1 + x2 + x3 + ... | z | d1 + d2 + d3 + ...
Here, d1 + d2 + d3 + ... are indicator ("dummy") variables that will
be adjusted so that the decomposition results do not change depending on the
user’s choice of the reference category (Gardeazabal and Ugidos, 2004).
I cannot run formula 2), but I can run formula 1) when I delete a couple of dummy variables, otherwise I get an error.
I have 5 levels (separate variables) for Age (1st - 14 to 19, 2nd - 20 to 29, 3rd - 30 to 39, etc.), 4 levels for tenure (1st - 0 to 2, 1nd - 2 to 4 years), 15 levels for Industry, 63 levels for Occupation, etc. I am going to call Age, Tenure, Industry and Occupation my different 'types' that should each have their own reference category ommitted from the formula.
Since I use a lot of 'types' of indicator variables, what I don't understand is, how does 'R' recognize which reference category belongs to which 'type'?
Maybe 'R' reads all "dummy" variables as levels of the same 'type', and selects only 1 ommitted variable as reference category for all the variables?
Is there any way that you know in which I could adapt my data to specify the correct reference category for each 'type'? Judging by the example with 'Chicago' dataframe it seems like I have too many different 'types' of variables for this formula to work.
The original data I have is from the Lithuanian Structure of Earnings Survey 2014. I have created new data in excel (later converted to a .csv file) using the original, following the example of the 'Chicago' dataframe, used in the 'Oaxaca' package example. The data created is mostly made of dummy variables with the values of "0" or "1", except for the Hours column, which contains hours worked in a month, and the log.wage column, containing the natural logarythm of the hourly wage. Everything else is indicator variables. However, these indicator variables belong to different types, as mentioned already, such as Age, Tenure, etc.
I have been unsuccessful in trying to manipulate the original dataset to create indicator variables using 'R', because I need to create specific new variables from a variety of the existing ones, for example all the occupations coded 431 and 432 should be merged into 1 variable titled 'prof43'. I have not found out how to do this so far.
My data contains mostly indicator variables and the variable types look like this:
str(S14)
'data.frame': 44952 obs. of 71 variables:
$ hours : int 1 1 1 1 2 1 1 2 1 1 ...
$ female : int 0 1 1 1 0 0 0 1 0 0 ...
$ age0 : int 0 0 0 0 0 0 0 0 0 0 ...
$ age1 : int 1 1 0 0 0 0 0 1 1 0 ...
$ age2 : int 0 0 0 1 0 1 0 0 0 0 ...
$ age3 : int 0 0 1 0 1 0 0 0 0 1 ...
$ age4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ age5 : int 0 0 0 0 0 0 1 0 0 0 ...
$ prof11 : int 0 0 0 0 0 0 0 0 0 0 ...
......
$ prof96 : int 0 0 0 0 1 0 0 0 0 0 ...
$ edu1 : int 0 0 0 0 0 0 0 0 1 0 ...
$ edu2 : int 0 1 0 0 1 1 0 1 0 1 ...
$ edu3 : int 1 0 1 1 0 0 1 0 0 0 ...
$ ten1 : int 1 1 1 1 1 1 1 1 1 1 ...
$ ten2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ ten3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ ten4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ size1to50: int 1 1 0 1 1 1 0 1 1 1 ...
$ nace1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ nace2 : int 0 0 0 0 0 0 0 0 0 0 ...
......
$ nace15 : int 0 0 0 0 0 0 0 0 0 0 ...
$ pubcon : int 0 0 0 0 0 0 0 0 0 0 ...
$ temp : int 0 0 0 0 0 0 0 0 0 0 ...
$ log.wage : num 1.79 1.79 1.79 1.79 1.79 ...
I run the 'Oaxaca' function using these codes:
library(oaxaca)
set.seed(03104) #random seed
I get results from this, yet I doubt their validity due to the fact that I delete 1 non-zero indicator variable (prof 62) (otherwise it doesn't run):
results0 <- oaxaca(log.wage ~ hours + pubcon + temp + size1to50 + age0 + age1 + age2 +
age4 + age5 + ten1 + ten2 + ten4 + edu1 + edu3 + prof11 + prof12 + ..... +
prof96 + nace1 + nace2 + ... + nace14 | female, data = S14, R = 30)
# 1) y ~ x1 + x2 + x3 + ... | z
The code that gets the error message for me:
results1 <- oaxaca(log.wage ~ hours + pubcon + temp + size1to50 +
age0 + age1 + age2 + age4 + age5 + ten1 + ten2 + ten4 + edu1 + edu3 +
prof11 + prof12 + ..... + prof96 + nace1 + nace2 + ... + nace14 | female |
pubcon + temp + size1to50 + age0 + age1 + age2 + age4 + age5 + ten1 + ten2 +
ten4 + edu1 + edu3 + prof11 + prof12 + ..... + prof96 + nace1 + nace2 + ... + nace14,
data = S14, R = 30) # 2) y ~ x1 + x2 + x3 + ... | z | d1 + d2 + d3 + ...
Running this, I get the error message:
Variables d1 + d2 + d3 + ... in argument 'formula' must indicate membership in mutually exclusive categories.
Does anyone have any suggestions?
Do you think using the original dataset and sorting it into indicator variables using 'R' would work, and I could select the reference category which the function 'formula' would recognize?
If so, what package and formulas do you suggest using to adapt my data?
Or do you think I am using too many variables for this 'Oaxaca' package and I should restrict my data?
Also, do the resulsts I get with formula 1) make sense? I am worried that 'R' does not choose the correct reference category for each 'type' of variable set resulting in all the indicator variables being dependent on some random ommitted variable, which would make the results nonsensical.
Excuse my lengthly ramblings, I hope I made some sense and if anyone has any experience of working with the 'Oaxaca' package or any ideas on what to do here and want to voice them - I am extremely grateful in advance!
I wrote the creator of the package and sent him a link to this page since I was having the same problem. Here is what he responded: "It looks like you are trying to include several (independent) sets of dummy variables, hence the error message. The oaxaca package, unfortunately, does not support this."
For what it's worth, it appears like oaxaca in Stata does support this, if you are looking for an alternative.
I have multiple response questions which have 5 categories (values). I want to get respondents who answered only one category.
For example,
Respondents who answered category not 2,3,4,5.
I want only A mentions like, who are all checked A category alone. I need count of this.
Help, Please.
The following solution is assuming the data has 5 dichotomous variables - one for each of the multiple response categories.
* creating some sample data to demonstrate on.
data list list/cat1 to cat5.
begin data
1 0 0 0 1
0 1 1 0 0
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 0 1
1 0 0 0 0
1 1 1 0 0
end data.
* now checking in which cases only category 1 was chosen.
compute NumCats=sum(cat1 to cat5).
if cat1=1 and NumCats=1 onlyCat1=1.
execute.
* if instead you wish to do the same check for each of the 5 categories,
use `do repeat` this way.
do repeat cat=cat1 to cat5/only=only1 to only5.
compute only=(cat=1 and NumCats=1).
end repeat.
execute.
But ditch the EXECUTE commands. They just cause a useless data pass in this case except for immediately updating the Data Editor (instead of updating on the next data pass).
I have a input uff file with 'n' no.of channels. I want to read the UFF file and also split the values based on each individual channel. Then store the result for each channel in separate file. Each channel always start with '-1' '58' etc., and end with '-1'.
Example channel_01 from the input UFF file:
-1
58
filename
22-Mar-2016 10:16:53
164
MnBrgFr-AC225R/N;50.9683995923 mV/m/s2
0 0 0 0 channel_01 0 0 NONE 0 0
2 1048576 1 0.00000E+00 8.19669930804e-06 0.00000E+00
17 0 0 0 Time s
1 0 0 0 MnBrgFr-AC225R/N m/s2
0 0 0 0 NONE NONE
0 0 0 0 NONE NONE
392.665124452 392.659048025 392.658404832 392.661676933 392.665882251 392.671989083
392.67634175 392.673743248 392.672398388 392.669360175 392.665533757 392.66088639
392.660390546 392.660975268 392.663400693 392.662668621 392.661209156 392.65498538
392.649463269 392.649580214 392.649259786 392.658580248 392.664715147 392.667051694
-1
I have 7909 observations in a Dataframe. I want to record Engineer ID against a logical value (y/n) to indicate whether he arrived on site early/late:
> head(table(f_SOMs$ENG.ID, f_SOMs$EARLY))
FALSE TRUE
4PXO18 0 0
5BZS21 0 0
5DAD37 0 0
6YJM50 6YGP51 0 0
7JVE00 0 0
7KAM01 0 0
The problem I have is it doesn't print the whole thing and gives this at the end of printing:
[ reached getOption("max.print") -- omitted 5061 rows ]
How to get all the values? Is there an easier way please?