I have 1000 individuals with measurements of plasmatic HDL levels over the last 15 years. The number of measurements per individual ranges from 1 to 15, and were not taken in the same dates. I would like to compare the HDL levels over time in the individuals and make a ranking of those individuals whose HDL levels are consistently the highest over time. There must be some sort of linear regression method for measuring this effect, but I am not acquainted with it. One concern is what is the minimum number of measurements per individual that I should consider in order to include/exclude certain individuals in the analysis.
I have a data.frame with 4 columns: Individual_ID, HDL_level, Age (at the time of measurement) and Sex. This is a data.frame with a toy example:
Individual_ID HDL_level Age (years old) Sex
1 50 12.3 M
1 52.1 15.4 M
1 55.3 17.1 M
2 45 12.1 M
2 46.3 12.1 M
3 60 14.3 F
3 55 16.2 F
... ... ... ...
I would like to do 2 rankings using R: 1) male and female considered together, NO adjustment for sex; and 2) male and female considered together, adjusting for sex. In both of them, I would like to adjust for diet (e.g., whether or not they ingested food during the 3 last hours before the measurements were taken). Thank you in advance.


R: How to measure difference with both categorical and numeric features

I'm very new to data wrangling. And now I have this problem at hand:
So basically I have used tables of biochemical measurements (all numerical) of patients to perform cluster analysis, and by doing so I sorted them into 5 clusters.
Then I also have their clinical data/features, now I want to ask if any of these clinical features (a mix of numerical and categorical features) are significantly different from one cluster to another. So how can I go about this? What test shall I perform? Is there a good library I should be looking at?
To give you an idea about the "clinical data":
ClusterAssigned PatientID age sex stage FISH IGHV IgG ...
1 S134567 50 m 4 11q mutated scig
1 S234667 80 m 2 13q mutated 6.5
1 S135677 55 f 4 11q na scig
1 S356576 94 f 2 13q,t12 unmutated 5
1 S187978 59 m 4 11q mutated scig
4 S278967 80 f 2 17q unmutated 6.5
4 S123467 75 f 4 na unmutated 9.1
4 S234577 62 m 2 t12 mutated 9
So you see the Cluster assigned is based on my cluster analysis. FISH, IGHV, IgG are categorical, and you can see there are sometimes na values and sometimes one person can have multiple entry "13q,t12".
In a discounted way, I can perhaps just take cluster 1 and 4 patients out, emit all na ones, and ask if there is a difference in their age, sex, FISH, IGHV...Still what's the method I can use here to perform such test in one go?
You can convert the categorical variables into dummy variables first and then perform a normal cluster analysis.
Things get more complicated if you have ordered categorical fields

Check if a variable is time invariant in R

I tried to search an answer to my question but I find the right answer for Stata (I am using R).
I am using a national survey to study which variables influence the investment in complementary pension (it is voluntary in my country).
The survey is conducted every two years and some individuals are interviewed more than one time. I filtered the df in order to have only the individuals present more than one time trought the filter command. This is an example from the original survey already filtered:
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 1
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
2008 4 1972 F 33000 0
2010 4 1972 F 35000 0
where id is the individual, y.b is year of birth, pens is a dummy which takes value 1 if the individual invests in a complementary pension form.
I wanted to run a FE regression so I load the plm package and then I set the df like this:
df.p <- plm.data(df, c("id", "year")
After this command, I expected that constant variables were deleted but after running this regression:
pan1 <- plm (pens ~ woman + age + I(age^2) + high + medium + north + centre, model="within", effect = "individual", data=dd.p, na.action = na.omit)
(where woman is a variable which takes value 1 if the individual is a woman, high, medium refer to education level and north, centre to geographical regions) and after the command summary(pan1) the variable woman is still present.
At this point I think that there are some mistakes in the survey (for example sex was not insert correctly and so it wasn't the same for the same id), so I tried to find a way to check if for each id, sex is constant.
I tried this code but I am sure it is not correct:
df$x <- ifelse(df$id==df$id & df$sex==df$sex,1,0)
the basic idea shuold be like this:
df$x <- ifelse(df$id=="1" & df$sex=="F",1,0)
but I can't do it manually since the df is composed up to 40k observations.
If you know another way to check if a variable is constant in R I will be glad.
Thank you in advance
I think what you are trying to do is calculate the number of unique values of sex for each id. You are hoping it is 1, but any cases of 2 indicate a transcription error. The way to do this in R is
any(by(df$sex,df$id,function(x) length(unique(x))) > 1)
To break that down, the function length(unique(x)) tells you the number of different unique values in a vector. It's similar to levels for a factor (but not identical, since a factor can have levels not present).
The function by calculates the given function on each subset of df$sex according to df$id. In other words, it calculates length(unique(df$sex)) where df$id is 1, then 2, etc.
Lastly, any(... > 1) checks if any of the results are more than one. If they are, the result will be TRUE (and you can use which instead of any to find which ones). If everything is okay, the result will be FALSE.
We can try with dplyr
Example data:
Id 1 is both F and M
# A tibble: 4 x 2
id sexes
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 4 1
We can then filter:
# A tibble: 1 x 2
id sexes
<dbl> <int>
1 1 2

'Forward' cumulative sum in dplyr

When examining datasets from longitudinal studies, I commonly get results like this from a dplyr analysis chain from the raw data:
df = data.frame(n_sessions=c(1,2,3,4,5), n_people=c(59,89,30,23,4))
i.e. a count of how many participants have completed a certain number of assessments at this point in time.
Although it is useful to know how many people have completed exactly n sessions, we more often need to know how many have completed at least n sessions. As per the table below, a standard cumulative sum isn't appropriate, What we want are the values in the n_total column, which is a sort of "forwards cumulative sum" of the values in the n_people column. i.e. the value in each row should be the sum of the values of itself and all values beyond it, rather than the standard cumulative sum, which is the sum of all values up to and including itself:
n_sessions n_people n_total cumsum
1 59 205 59
2 89 146 148
3 30 57 178
4 23 27 201
5 4 4 205
Generating the cumulative sum is simple:
mutate(df, cumsum = cumsum(n_people))
What would be an expression for generating a "forwards cumulative sum" that could be incorporated in a dplyr analysis chain? I'm guessing that cumsum would need to be applied to n_people after sorting by n_sessions descending, but can't quite get my head around how to get the answer while preserving the original order of the data frame.
You can take a cumulative sum of the reversed vector, then reverse that result. The built-in rev function is helpful here:
mutate(df, rev_cumsum = rev(cumsum(rev(n_people))))
For example, on your data this returns:
n_sessions n_people rev_cumsum
1 1 59 205
2 2 89 146
3 3 30 57
4 4 23 27
5 5 4 4

Using weights in R to consider the inverse of sampling probability [duplicate]

This is similar but not equal to Using weights in R to consider the inverse of sampling probability.
I have a long data frame and this is a part of the real data:
age gender labour_situation industry_code FACT FACT_2....
35 M unemployed 15 1510
21 F inactive 00 651
FACT is a variable that means, for the first row, that a male unemployed individual of 35 years represents 1510 individuals of the population.
I need to obtain some tables to show relevant information like the % of employed and unemployed people, etc. In Stata there are some options like tab labour_situation [w=FACT] that shows the number of employed and unemployed people in the population while tab labour_situation shows the number of employed and unemployed people in the sample.
A partial solution could be to repeat the 1st row of the data frame 1510 times and then the 2nd row of my data frame 651 times? As I've searched one options is to run
longdata <- data[rep(1:nrow(data), data$FACT), ]
employment_table = with(longdata, addmargins(table(labour_situation, useNA = "ifany")))
The other thing I need to do is to run a regression having in mind that there was cluster sampling in the following way: the population was divided in regions. This creates a problem: one individual
interviewed in represents people while an individual interviewed in represents people but and are not in proportion to the total population of each region, so some regions will be overrepresented and other regions will be underrepresented. In order to take this into account, each observation should be weighted by the inverse of its probability of being sampled.
The last paragraph means that the model can be estimated with valid equations BUT the variance-covariance matrix won't be but if I consider the inverse of sampling probability.
In Stata it is possible to run a regression by doing reg y x1 x2 [pweight=n] and that calculates the right variance-covariance matrix considering the inverse of sampling probability. At the time I have to use Stata for some part of my work and R for others. I'd like to use just R.
You can do this by repeating the rownames:
df1 <- df[rep(row.names(df), df$FACT), 1:5]
> head(df1)
age gender labour_situation industry_code FACT
1 35 M unemployed 15 1510
1.1 35 M unemployed 15 1510
1.2 35 M unemployed 15 1510
1.3 35 M unemployed 15 1510
1.4 35 M unemployed 15 1510
1.5 35 M unemployed 15 1510
> tail(df1)
age gender labour_situation industry_code FACT
2.781 21 F inactive 0 787
2.782 21 F inactive 0 787
2.783 21 F inactive 0 787
2.784 21 F inactive 0 787
2.785 21 F inactive 0 787
2.786 21 F inactive 0 787
here 1:5 refers to the columns to keep. If you leave that bit blank, all will be returned.

calculate gender percentage from grouped data frame in R

I have fairly large data frame that includes information on individuals divided into treatment groups. I am trying to generate variable means and gender percentages per group. I was able to calculate the means but I am not sure how to get the gender percentages.
Below, I generated a small replica of what my data looks like:
#create variables and data frame
gender = rep(c("female","male"),c(50,50))
score <- rnorm(100)
treatment <- rep(seq(1:5), each=4)
d <- data.frame(sampleid,gender,age,score, treatment)
sampleid gender age score treatment
1 1 female 34 1.6917201 1
2 2 female 26 -1.6189545 1
3 3 female 28 1.2867895 1
4 4 female 34 -0.5027578 1
5 5 female 29 -1.3652895 2
6 6 female 26 -2.4430843 2
I obtain the mean of each numeric column by:
groupstat<-ddply(d, .(treatment),numcolwise(mean))
which gives:
treatment sampleid age score
1 1 42.5 29.15 0.142078574
2 2 46.5 29.50 -0.261492514
3 3 50.5 30.50 -0.188393235
4 4 54.5 30.45 0.003526078
5 5 58.5 30.55 0.062996737
However I also need an additional column "Percent Female", which should give me the percentage of females within each treatment group 1:5.
Can someone help me in how to add this?
Try this out
groupstat<-ddply(d, .(treatment),summarise,
meansc= mean(score),
meanage= mean(age),
meanID= mean(sampleid),
nfem= length(gender[gender=="female"]), # number females per treatment group
nmale= length(gender[gender=="male"]), # number of males per treatment group
percentfem= nfem/(nfem+nmale)) # percent females by treatment group
I would first split into treatment groups (split(d, f = d$treatment)) and than calc the means for each group (function(x) sum(x$gender == "female")/length(x$gender):
sapply(split(d, f = d$treatment), function(x) sum(x$gender == "female")/length(x$gender))
