Check if a variable is time invariant in R - r

I tried to search an answer to my question but I find the right answer for Stata (I am using R).
I am using a national survey to study which variables influence the investment in complementary pension (it is voluntary in my country).
The survey is conducted every two years and some individuals are interviewed more than one time. I filtered the df in order to have only the individuals present more than one time trought the filter command. This is an example from the original survey already filtered:
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 1
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
2008 4 1972 F 33000 0
2010 4 1972 F 35000 0
where id is the individual, y.b is year of birth, pens is a dummy which takes value 1 if the individual invests in a complementary pension form.
I wanted to run a FE regression so I load the plm package and then I set the df like this:
df.p <- plm.data(df, c("id", "year")
After this command, I expected that constant variables were deleted but after running this regression:
pan1 <- plm (pens ~ woman + age + I(age^2) + high + medium + north + centre, model="within", effect = "individual", data=dd.p, na.action = na.omit)
(where woman is a variable which takes value 1 if the individual is a woman, high, medium refer to education level and north, centre to geographical regions) and after the command summary(pan1) the variable woman is still present.
At this point I think that there are some mistakes in the survey (for example sex was not insert correctly and so it wasn't the same for the same id), so I tried to find a way to check if for each id, sex is constant.
I tried this code but I am sure it is not correct:
df$x <- ifelse(df$id==df$id & df$sex==df$sex,1,0)
the basic idea shuold be like this:
df$x <- ifelse(df$id=="1" & df$sex=="F",1,0)
but I can't do it manually since the df is composed up to 40k observations.
If you know another way to check if a variable is constant in R I will be glad.
Thank you in advance

I think what you are trying to do is calculate the number of unique values of sex for each id. You are hoping it is 1, but any cases of 2 indicate a transcription error. The way to do this in R is
any(by(df$sex,df$id,function(x) length(unique(x))) > 1)
To break that down, the function length(unique(x)) tells you the number of different unique values in a vector. It's similar to levels for a factor (but not identical, since a factor can have levels not present).
The function by calculates the given function on each subset of df$sex according to df$id. In other words, it calculates length(unique(df$sex)) where df$id is 1, then 2, etc.
Lastly, any(... > 1) checks if any of the results are more than one. If they are, the result will be TRUE (and you can use which instead of any to find which ones). If everything is okay, the result will be FALSE.

We can try with dplyr
Example data:
df=data.frame(year=c(2002,2002,2004,2004,2006,2008,2008,2010),
id=c(1,2,1,2,3,3,4,4),
sex=c("F","M","M","M","M","M","F","F"))
Id 1 is both F and M
library(dplyr)
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))
# A tibble: 4 x 2
id sexes
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 4 1
We can then filter:
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))%>%filter(sexes==2)
# A tibble: 1 x 2
id sexes
<dbl> <int>
1 1 2

Related

How to calculate percentages for categorical variables by items?

I have a question about calculating the percentage by items and time bins. The experiment is like this:
I conduct an eye-tracking experiment. Participants were asked to describe pictures consisting of two areas of interest(AOIs; I name them Agent and Patient). Their eye movements (fixations on the two AOIs) were recorded along the time when they plan their formulation. I worked out a dataset included time information and AOIs as below (The whole time from the picture onset was divided into separate time bins, each time bin 40 ms).
Stimulus Participant AOIs time_bin
1 M1 agent 1
1 M1 patient 2
1 M1 patient 3
1 M1 agent 4
...
1 M2 agent 1
1 M2 agent 2
1 M2 agent 3
1 M2 patient 4
...
1 M3 agent 1
1 M3 agent 2
1 M3 agent 3
1 M3 patient 4
...
2 M1 agent 1
2 M1 agent 2
2 M1 patient 3
2 M1 patient 4
I would like to create a table containing the proportion of one AOI (e.g. agent) by each stimulus of each time bin. It would be like this:
Stimulus time_bin percentage
1 1 20%
1 2 40%
1 3 55%
1 4 60%
...
2 1 30%
2 2 35%
2 3 40%
2 4 45%
I calculate the percentage because I want to do a multilevel analysis (Growth Curve Analysis) investigating the relationship between the dependent variable agent fixation proportion and the independent variable time_bin, as well as with the stimulus as a random effect.
I hope I get my question understood, due to my limited English knowledge.
If you have an idea or a suggestion, that would be a great help!
Using the tidyverse package ecosystem you could try:
library(tidyverse)
df %>%
mutate(percentage = as.integer(AOIs == "agent") ) %>%
group_by(Stimulus, time_bin) %>%
summarise(percentage = mean(percentage))
Note that this will give you ratios in the [0, 1] interval. You still have to convert it to the percentage values by multiplying with 100 and appending "%".

Need to get total for Column in R

I have done the code up to this point, but have a column called score where I have to add the total together in the rscores tibble.
library(tidyverse)
responses <- read_csv("responses.csv")
qformats <- read_csv("qformats.csv")
scoring <- read_csv("scoring.csv")
rlong <- gather(responses,Question, Response, Q1:Q10)
rlong_16 <- filter(rlong, Id == 16)
rlong2 <- inner_join(rlong_16, qformats, by = "Question")
rscores <- inner_join(rlong_2, scoring)
What line of code do I add next to get the total for this column? I have been scratching my head for hours. Any help is appreciated :)
> head(rscores)
# A tibble: 6 x 5
Id Question Response QFormat Score
<dbl> <chr> <chr> <chr> <dbl>
1 16 Q1 Slightly Disagree F 0
2 16 Q2 Definitely Agree R 0
3 16 Q3 Slightly Disagree R 1
4 16 Q4 Definitely Disagree R 1
5 16 Q5 Slightly Agree R 0
6 16 Q6 Slightly Agree R 0
colSums() is overkill if you just need the sum of one column, and it will give you an error if any other column in the tibble/data.frame/etc. is not convertible to numeric. In you case, there's at least one character (chr) column that can't be summed. Typically you'd use rowSums or colSums on a matrix as opposed to a data frame.
Just use sum function on the one column: sum(rscores$Score). Best of luck.

Using multiple weights for lm() or glm() in R

I want to fit a model in R within which I need to apply two weights at the same time. Let's say my model is glm(y ~ x1 + male + East_Germany) where male identifies a respondent's gender and East Germany is also a binary variable checking whether someone lives in East Germany.
Now let's say both females and East Germans are dramatically missrepresented in my data. Assuming that it's not due to a flawed data collection process, I would have to apply two weights. But can I really specify two weights simultaneously like this...? glm(y ~ x1 + male + East_Germany, weight=c("male_wgt","east_wgt"))
My thinking was that applying one weight to the overall data or the overall model already changes up the whole data structure, but I might be wrong. Let's give you an example:
y x1 male east male_weight east_weight
5 4 0 1 5 2
3 2 1 1 1 2
9 7 1 0 1 1
4 8 1 0 1 1
1 3 1 0 1 1
6 4 1 0 1 1
...where male==1 means "male" and male==0 is "female" and we assume that both genders should be represented equally (50%), and where east==1 means "East Germany", east==0 is "West Germany" and here too, let's assume for reasons of simplicity that both should be represented equally. yand x are just random numbers.
I'm wondering how I would apply both weights simultaneously, because if I say "let's count row#1 five times so that females gain more weight", I'm at the same time giving East Germany too much weight (without having even applied east_weight). The reasons is that if we count row#1 five times, we will end up with a new East/West ratio of 6:4. Or am I mistaken?

Working with repeates values in rows

I am working with a df of 46216 observation where the units are homes and people, where each home may have any number of integrants, like:
enter image description here
and this for another almost 18000 homes.
What i need to do is to get the mean of education years for every home, for what i guess i will need a variable that computes the number of people of each home.
What i tried to do is:
num_peopl=by(df$person_number, df$home, max), for each home I take the highest person number with the total number of people who live there, but when I try to cbind this with the df i get:
"arguments imply differing number of rows: 46216, 17931"
It is like it puts the number of persons only for one row, and leaves the others empty.
How can i do this? Is there a function?
I think aggregate and join may be what your looking for. Aggregate does the same thing that you did, but puts it into a data frame that I'm more familiar with at least.
Then I used dplyr left_join, joining the home number's together:
library(tidyverse)
df<-data.frame(home_number = c(1,1,1,2,2,3),
person_number = c(1,2,3,1,2,1),
age = c(20,21,1,54,50,30),
sex = c("m","f","f","m","f","f"),
salary = c(1000,890,NA,900,500,1200),
years_education = c(12,10,0,8,7,14))
df2<-aggregate(df$person_number, by = list(df$home_number), max)
df_final<-df%>%
left_join(df2, by = c("home_number" = "Group.1"))
home_number person_number age sex salary years_education x
1 1 1 20 m 1000 12 3
2 1 2 21 f 890 10 3
3 1 3 1 f NA 0 3
4 2 1 54 m 900 8 2
5 2 2 50 f 500 7 2
6 3 1 30 f 1200 14 1

Assigning data frame column values probabilistically

I am trying to create a data frame named "students" with four variables: Gender, Year (Freshman, Sophomore, Junior, Senior), Age, and GPA. The idea is to have a data frame that illustrates the four levels of measurement: nominal, ordinal, interval, and ratio.
At this point it looks something like this:
ID Gender Year Age GPA
1 Male Sophomore 0 3.9
2 Male Junior 0 3.3
3 Female Junior 0 3.6
4 Male Freshman 0 3.1
5 Female Senior 0 2.9
I'm having a problem with Age. I would like Age to be assigned based on a probability. For example, if a student is a Freshman, I'd like Age to be assigned along something like the following lines:
Age Probability
14 .47
15 .48
16 .05
I have a function to do that set up like this:
1: Age <- function(df) {
2: for (i in 1:nrow(df) {
3: if (df[i, 2] == "Freshman") {
4: df[i, 3] = 15
5: } else if {
6: continue through the years
7: }
8: }
9: }
My thinking is that I want to change the right side of the assignment in Line 4 to something that will assign the age probabilistically. That's what I cannot figure out how to do.
On a related note, if there's a better way to do it than what I'm considering, I'd be appreciative of hearing that.
And on a final note, I've Googled the web at large, queried the R forums on Reddit and Talk Stats, and searched the R tags on this site, all to no avail. I can't believe I'm the first person who's ever wanted to do something like this, so it occurs to me that maybe I'm phrasing the query wrong. If that's the case, any guidance there would also be appreciated.
Use sample function like this:
sample(14:16, size=1,prob=c(0.47, 0.48, 0.05))
## [1] 14
sample(14:16, size=10,rep=TRUE,prob=c(0.47, 0.48, 0.05))
## [1] 14 14 15 14 15 16 15 15 15 15

Resources