Creating a Dichotomous Variable in R - r

I have imported a csv file which contains 2044 observations of 3 variables, CASEID, DEGREE, and HRS1.
The first 6 observations appeared as:
head(degree.wrk)
CASEID DEGREE HRS1
1 53044 3 55
2 53045 3 45
3 53046 0 -1
4 53047 0 -1
5 53048 0 -1
6 53049 0 -1
I want to create a dichotomous variable based on DEGREE which determines if a person has earned at
least a bachelors degree. According to the codebook, DEGREE values greater than or equal to 3 indicate a minimum of Bachelor's Degree earned. If the minimum has been met, I want it to return "Yes", if not, I want it to return "No". I used the ifelse() function and it appears to have worked, but I wonder if replacing the numerical value of DEGREE with the YES or NO category label is the correct action when seeking to create a dichotomous variable, or if I have simply replaced or recoded an existing variable.
The results of the ifelse() function are as follows:
degree.wrk$DEGREE <- ifelse(degree.wrk$DEGREE >=3,
c("Yes"),
c("No"))
head(degree.wrk)
CASEID DEGREE HRS1
1 53044 Yes 55
2 53045 Yes 45
3 53046 No -1
4 53047 No -1
5 53048 No -1
6 53049 No -1
Any advice as to whether or not I adequately created a dichotomous variable using this method?

Related

Regression with before and after

I have a dataset with four variables (df)
household
group
income
post
1
0
20'000
0
1
0
22'000
1
2
1
10'000
0
2
1
20'000
1
3
0
20'000
0
3
0
21'000
1
4
1
9'000
0
4
1
16'000
1
5
1
8'000
0
5
1
18'000
1
6
0
22'000
0
6
0
26'000
1
7
1
12'000
0
7
1
24'000
1
8
0
24'000
0
8
0
27'000
1
Group is a binary variable and is 1, when household got support from state. and post variable is also binary and is 1, when it is after some household got support from state.
Now I would like to run a before vs after regression that estimates the group effect by comparing post-period and before period for the supported group. I would like to put the dependent variable in logs, to have the effect in percentage, so the impact of state support on income.
I used that code, but I don't know if it is right to get the answer?
library("fixest")
feols(log(income) ~ group + post,data=df) %>% etable()
Is there another way?
If you are looking for the classic 2x2 design your code was almost correct. Change '+' with '*'. This tell us that the supported group increased the income with 7 250 more than the group which not received support.
comparing = feols(income ~ group * post,data)
comparing_log = feols(log(income) ~ group * post,data)
etable(comparing,comparing_log)
PS: The interpretation of the coefficient as percentage change is a good approximation for small numbers. The correct formula for % change is: exp(beta)-1. In this case it is exp(0.5829)-1 = 0.7912.
So the change here is 79,12%.

Matching two datasets using different IDs

I have two datasets, one is longitudinal (following individuals over multiple years) and one is cross-sectional. The cross-sectional dataset is compiled from the longitudinal dataset, but uses a randomly generated ID variable which does not allow to track someone across years. I need the panel/longitudinal structure, but the cross-sectional dataset has more variables available than the longitudinal,
The combination of ID-year uniquely identifies each observation, but since the ID values are not the same across the two datasets (they are randomized in cross-sectional so that one cannot track individuals) I cannot match them based on this.
I guess I would need to find a set of variables that uniquely identify each observation, excluding ID, and match based on those. How would I go about ding that in R?
The long dataset looks like so
id year y
1 1 10
1 2 20
1 3 30
2 1 15
2 2 20
2 3 5
and the cross dataset like so
id year y x
912 1 10 1
492 2 20 1
363 3 30 0
789 1 15 1
134 2 25 0
267 3 5 0
Now, in actuality the data has 200-300 variables. So I would need a method to find the smallest set of variables that uniquely identifies each observation in the long dataset and then match based on these to the cross-sectional dataset.
Thanks in advance!

Reverse cumsum with breaks with non-sequential numbers

Looking to fill a matrix with a reverse cumsum. There are multiple breaks that must be maintained.
I have provided a sample matrix for what I want to accomplish. The first column is the data, the second column is what I want. You will see that column 2 is updated to reflect the number of items that are left. When there are 0's the previous number must be carried through.
update <- matrix(c(rep(0,4),rep(1,2),2,rep(0,2),1,3,
rep(10,4), 9,8,6, rep(6,2), 5, 2),ncol=2)
I have tried multiple ways to create a sequence, loop using numerous packages (i.e. zoo). What is difficult is that the numbers in column 1 can be between 0,1,..,X but less than column 2.
Any help or tips would be appreciated
EDIT: Column 2 starts with a given value which can represent any starting value (i.e. inventory at the beginning of a month). Column 1 would then represent "purchases" made which; thus, column 2 should reflect the total number of remaining items available.
The following will report the purchase and inventory balance as described:
starting_inventory <- 100
df <- data.frame(purchases=c(rep(0,4),rep(1,2),2,rep(0,2),1,3))
df$cum_purchases <- cumsum(df$purchases)
df$remaining_inventory <- starting_inventory - df$cum_purchases
Result:
purchases cum_purchases remaining_inventory
1 0 0 100
2 0 0 100
3 0 0 100
4 0 0 100
5 1 1 99
6 1 2 98
7 2 4 96
8 0 4 96
9 0 4 96
10 1 5 95
11 3 8 92

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

how to add variable in longitudinal data using SPSS or R?

I have a file with repeated measures data and another file with single observations for the same persons (e.g. in one file subjects have repeated assessments and the other file just says if subjects are male or female) when I merge the files I get something like this:
ID time gender
1 1 0
1 2
1 3
2 1 1
2 2
3 1 0
3 2
3 3
3 4
but I want that the variable that was measured once (e.g.male/female) to be repeated across time (in each row) for each subject. So I would like to have :
1 1 0
1 2 0
1 3 0
2 1 1
2 2 1
and not do it manually, since I have thousands of cases...
How to do this in SPSS (preferably), or in R ?
You should have used match files with one "file" (multiple record per ID) and one "table" (no duplicate ID's).
But you can probably still fix it by running
sort cases by ID.
if mis(gender) and ID = lag(ID) gender= lag(gender).
Wherever there's no value for gender, it will be filled in with the gender of the previous case if it has the same ID as the current one.

Resources