When working on a hierarchical/multilevel/panel dataset, it may be very useful to adopt a package which returns the within- and between-group standard deviations of the available variables.
This is something that with the following data in Stata can be easily done through the command
xtsum, i(momid)
I made a research, but I cannot find any R package which can do that..
edit:
Just to fix ideas, an example of hierarchical dataset could be this:
son_id mom_id hispanic mom_smoke son_birthweigth
1 1 1 1 3950
2 1 1 0 3890
3 1 1 0 3990
1 2 0 1 4200
2 2 0 1 4120
1 3 0 0 2975
2 3 0 1 2980
The "multilevel" structure is given by the fact that each mother (higher level) has two or more sons (lower level). Hence, each mother defines a group of observations.
Accordingly, each dataset variable can vary either between and within mothers or only between mothers. birtweigth varies among mothers, but also within the same mother. Instead, hispanic is fixed for the same mother.
For example, the within-mother variance of son_birthweigth is:
# mom1 means
bwt_mean1 <- (3950+3890+3990)/3
bwt_mean2 <- (4200+4120)/2
bwt_mean3 <- (2975+2980)/2
# Within-mother variance for birthweigth
((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 +
(4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 +
(2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)
While the between-mother variance is:
# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
mean <- (3950+3890+3990+4200+4120+2975+2980)/7
# within variance:
((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)
I don't know what your stata command should reproduce, but to answer the second part of question about
hierarchical structure , it is easy to do this with list.
For example, you define a structure like this:
tree = list(
"var1" = list(
"panel" = list(type ='p',mean = 1,sd=0)
,"cluster" = list(type = 'c',value = c(5,8,10)))
,"var2" = list(
"panel" = list(type ='p',mean = 2,sd=0.5)
,"cluster" = list(type="c",value =c(1,2)))
)
To create this lapply is convinent to work with list
tree <- lapply(list('var1','var2'),function(x){
ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
cluster= list(type = 'c',value = rnorm(3))) ## R prefer symbols
})
names(tree) <-c('var1','var2')
You can view he structure with str
str(tree)
List of 2
$ var1:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num 0.284
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
$ var2:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num -0.144
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] -0.595 -1.795 -0.439
Edit after OP clarification
I think that package reshape2 is what you want. I will demonstrate this here.
The idea here is in order to do the multilevel analysis we need to reshape the data.
First to divide the variables into two groups :identifier and measured variables.
library(reshape2)
dat.m <- melt(dat,id.vars=c('son_id','mom_id')) ## other columns are measured
str(dat.m)
'data.frame': 21 obs. of 4 variables:
$ son_id : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
$ mom_id : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
$ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
$ value : num 1 1 1 0 0 0 0 1 0 0 ..
Once your have data in "moten" form , you can "cast" to rearrange it in the shape that you want:
# mom1 means for all variable
acast(dat.m,variable~mom_id,mean)
1 2 3
hispanic 1.0000000 0 0.0
mom_smoke 0.3333333 1 0.5
son_birthweigth 3943.3333333 4160 2977.5
# Within-mother variance for birthweigth
acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
1 2 3
hispanic 0.0000000 0 0.0
mom_smoke 0.6666667 0 0.5
son_birthweigth 5066.6666667 3200 12.5
## overall mean of each variable
acast(dat.m,variable~.,mean)
[,1]
hispanic 0.4285714
mom_smoke 0.5714286
son_birthweigth 3729.2857143
I know this question is four years old, but recently I wanted to do the same in R and came up with the following function. It depends on dplyr and tibble. Where: df is the dataframe, columns is a numerical vector to subset the dataframe and individuals is the column with the individuals.
xtsumR<-function(df,columns,individuals){
df<-dplyr::arrange_(df,individuals)
panel<-tibble::tibble()
for (i in columns){
v<-df %>% dplyr::group_by_() %>%
dplyr::summarize_(
mean=mean(df[[i]]),
sd=sd(df[[i]]),
min=min(df[[i]]),
max=max(df[[i]])
)
v<-tibble::add_column(v,variacao="overal",.before=-1)
v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
sdB<-sd(v2)
varW<-df[[i]]-rep(v2,each=12) #
varW<-varW+mean(df[[i]])
sdW<-sd(varW)
minB<-min(v2)
maxB<-max(v2)
minW<-min(varW)
maxW<-max(varW)
v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
panel<-rbind(panel,v)
}
var<-rep(names(df)[columns])
n1<-rep(NA,length(columns))
n2<-rep(NA,length(columns))
var<-c(rbind(var,n1,n1))
panel$var<-var
panel<-panel[c(6,1:5)]
names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
panel[3:6]<-as.numeric(unlist(panel[3:6]))
panel[3:6]<-round(unlist(panel[3:6]),2)
return(panel)
}
Related
*I have a large data set including 2000 variables, including factors and continuous variables.
For example:
library(finalfit)
library(dplyr)
data(colon_s)
explanatory = c("age", "age.factor", "sex.factor", "obstruct.factor")
dependent = "perfor.factor"
I use the following function to compare the mean of each continuous variable among the level of the categorical dependent variable (ANOVA) or the percentage of each categorical variable among the level of the categorical dependent variable (CHI-SQUARE)
summary_factorlist(colon_s, dependent ="perfor.factor", explanatory =explanatory , add_dependent_label=T, p=T,p_cat="fisher", p_cont_para = "aov", fit_id
= T)
But as soon as running the above code, I got the following error:
Error in dplyr::summarise():
! Problem while computing ..1 = ...$p.value.
Caused by error in fisher.test():
! 'x' and 'y' must have at least 2 levels
*In the data set, there are some variables which do not include at least two levels or just one of their levels has a non-zero frequency. I was wondering if there is any loop function to remove the variable if one of these conditions satisfies.
If the variable includes just one level
If the variable includes more than one level but the frequency of just one level is no-zero.
if all values of the variable are missing*
Update (partial answer):
With this code we can remove factors with only one level and keep other non factor variables:
x <- colon_s[, (sapply(colon_s, nlevels)>1) | (sapply(colon_s, is.factor)==FALSE)]
The OP's code does work with the data provided
library(dplyr)
library(finalfit)
summary_factorlist(colon_s, dependent ="perfor.factor",
explanatory =explanatory ,
add_dependent_label=TRUE, p=TRUE,p_cat="fisher", p_cont_para = "aov", fit_id = TRUE)
Dependent: Perforation No Yes p fit_id index
Age (years) Mean (SD) 59.8 (11.9) 58.4 (13.3) 0.542 age 1
Age <40 years 68 (7.5) 2 (7.4) 1.000 age.factor<40 years 2
40-59 years 334 (37.0) 10 (37.0) age.factor40-59 years 3
60+ years 500 (55.4) 15 (55.6) age.factor60+ years 4
Sex Female 432 (47.9) 13 (48.1) 1.000 sex.factorFemale 5
Male 470 (52.1) 14 (51.9) sex.factorMale 6
Obstruction No 715 (81.2) 17 (63.0) 0.026 obstruct.factorNo 7
Yes 166 (18.8) 10 (37.0) obstruct.factorYes 8
The strcture of data shows the factor variables to have more than 1 level
> str(colon_s[c(explanatory, dependent)])
'data.frame': 929 obs. of 5 variables:
$ age : num 43 63 71 66 69 57 77 54 46 68 ...
..- attr(*, "label")= chr "Age (years)"
$ age.factor : Factor w/ 3 levels "<40 years","40-59 years",..: 2 3 3 3 3 2 3 2 2 3 ...
..- attr(*, "label")= chr "Age"
$ sex.factor : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 2 2 2 1 ...
..- attr(*, "label")= chr "Sex"
$ obstruct.factor: Factor w/ 2 levels "No","Yes": NA 1 1 2 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Obstruction"
$ perfor.factor : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Perforation"
Regarding selection of factor variables with the condition mentioned, we could use
library(dplyr)
colon_s_sub <- colon_s %>%
select(where(~ is.factor(.x) && nlevels(.x) > 1 && all(table(.x) > 0) &
sum(complete.cases(.x)) > 0))
I have used the mice package for calculating imputations (10 iterations, 5 imputations). Because I am new to this area, my 'methodologist' - who is very patient with me! - want to judge the imputed values (so not the completed sets). I can't seem to find a way to collect all imputed values in one clear dataframe.
The data is about youngsters who answered a lot of questions in a 5 point Likert scale. I have several imp per age category. For example:
With my command imp_val_15_plus <- Filter(Negate(is.null), imp_15plus$imp)I can see all imputed values per question and per id. So for example imp_val_15plus[1:2] gives:
$X02_07
1 2 3 4 5
qwertyuiop123456789 4 4 4 4 4
$X02_12
1 2 3 4 5
adfghjkl09823430233 2 2 5 2 2
zcvnmoi987412597800 1 2 1 1 2
So here are two questions (X02_07 and X02_12). The first has one NA (id qwe...789) and the latter has two NA's (id adf...0233 and zcvn...7800)
I would like to have a dataframe like this:
q_nr id 1 2 3 4 5
$X02_07 qwertyuiop123456789 4 4 4 4 4
$X02_12 adfghjkl09823430233 2 2 5 2 2
$X02_12 zcvnmoi987412597800 1 2 1 1 2
So I thought of a way to extract the values I need and then try to use a loop for all these values. I tried to extract the values:
names(imp_val_15plus[1]) gives me the question number [1] "X02_07"
row.names(imp_val_15plus[[1]]) gives me the id number [1] "qwertyuiop123456789"
But then I go wrong with the imputed values.
With as.integer(imp_val_15plus[[1]]) I get [1] 3 3 3 3 3 instead of what I wanted [1] 4 4 4 4 4. The three's are logic, because of the factors available for question $X02_07. Normally there should be the factorlevels 1 - 5, but none of the youngsters used a 1, so my levels for this question are 2 - 5.
Have a look at str(imp_val_15plus[[1]]), it gives:
'data.frame': 1 obs. of 5 variables:
$ 1: Factor w/ 4 levels "2","3","4","5": 3
..- attr(*, "contrasts")= num [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr "2" "3" "4" "5"
.. .. ..$ : chr "2" "3" "4"
$ 2: Factor w/ 4 levels "2","3","4","5": 3
..- attr(*, "contrasts")= num [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr "2" "3" "4" "5"
.. .. ..$ : chr "2" "3" "4"
etc., etc.
It makes sense that I got the three's, because this is the number of the factor with the levels "2","3","4","5". How do I obtain the value of the value itself (the 4) instead of the 3? Or is there an other way to present all imputed values (not the completed set!!) in a neat way?
EDIT: This is not a course with a live instructor and I cannot ask him directly in any way. If it were, I wouldn't be wasting your time here.
I am taking an R class that is dealing with the basics of machine learning. We are working with the Vanderbilt Titanic dataset available HERE. The goal is the use the R mice package to imput missing age values. I've already split my data into train and test samples, and str(training) outputs:
'data.frame': 917 obs. of 14 variables:
$ pclass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ survived : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 2 2 2 ...
$ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mrs. Hudson J C (Bessie Waldo Daniels)" ...
$ sex : Factor w/ 2 levels "female","male": 1 2 1 1 2 1 2 1 1 1 ...
$ age : num 29 0.92 2 25 48 63 71 18 24 26 ...
$ sibsp : int 0 1 1 1 0 1 0 1 0 0 ...
$ parch : int 0 2 2 2 0 0 0 0 0 0 ...
$ ticket : chr "24160" "113781" "113781" "113781" ...
$ fare : num 211.3 151.6 151.6 151.6 26.6 ...
$ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...
$ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 2 2 2 4 ...
$ boat : chr "2" "11" "" "" ...
$ body : int NA NA NA NA NA NA 22 NA NA NA ...
$ home.dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
The instructor then goes on to write:
factor_vars <- c('pclass', 'sex', 'embarked', 'survived')
training[factor_vars] <- lapply(training[factor_vars], function(x) as.factor(x))
impute_variables <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked')
mice_model <- mice(training[,impute_variables], method='rf')
mice_output <- complete(mice_model)
mice_output
I understand the factor_vars piece - these variables are labelled as factors in the structure output. What I don't understand is how the impute_variables were chosen or what they mean exactly. Are they arbitrarily chosen, perhaps on the basis that the instructor believed things like 'pclass' (which is the indicator for steerage, coach, or first class) may help predict age (with older people being able to afford first class perhaps) while things like 'cabin' would have no relevance?
Furthermore, in the line mice_model <- mice(training[,impute_variables], method='rf'), which portion of the function is declaring that we want to be imputing the age of the passengers?
I'm using the aggregate function to summarise some data. The data is loans data, I have the ContractNum and LoanAmount. I want to aggregate the data by StartDate, count the number of Loans and Average the loan amount.
Here is a sample of the data and the function that I use:
ContractNum <- c("RHL-1","RHL-2","RHL-3","RHL-3")
StartDate <- c("2016-11-01","2016-11-01","2016-12-01","2016-12-01")
LoanPurpose <- c("Personal","Personal","HomeLoan","Investment")
LoanAmount <- c(200,500,600,150)
dat <- data.frame(ContractNum,StartDate,LoanPurpose,LoanAmount)
aggr.data <- aggregate(
cbind(LoanAmount,ContractNum) ~ StartDate + LoanPurpose
,data = dat
,FUN = function(x)c(count = mean(x),length(x))
)
When I lookat the results of the aggregate function, it looks ok:
> aggr.data
StartDate LoanPurpose LoanAmount.count LoanAmount.V2 ContractNum.count ContractNum.V2
1 2016-12-01 HomeLoan 600 1 3.0 1.0
2 2016-12-01 Investment 150 1 3.0 1.0
3 2016-11-01 Personal 350 2 1.5 2.0
But when I look at the strucutre of it, it seems to have created a sub-list:
> str(aggr.data)
'data.frame': 3 obs. of 4 variables:
$ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
$ LoanPurpose: Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
$ LoanAmount : num [1:3, 1:2] 600 150 350 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
$ ContractNum: num [1:3, 1:2] 3 3 1.5 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
How do I get rid of this sub-list so that I can access each column the way I would normally access a DF? I understand that in the code I've asked to give me a mean on a ContractNum which is not meaningful, but I can just get rid of that column.
Thank you
Just do do.call(data.frame, ...) on aggr.data to unnest the matrices.
aggr.data <- do.call(data.frame, aggr.data);
str(aggr.data);
#'data.frame': 3 obs. of 6 variables:
# $ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
# $ LoanPurpose : Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
# $ LoanAmount.count : num 600 150 350
# $ LoanAmount.V2 : num 1 1 2
# $ ContractNum.count: num 3 3 1.5
# $ ContractNum.V2 : num 1 1 2
I have a data.frame mydf, that contains data from 27 subjects. There are two predictors, congruent (2 levels) and offset (5 levels), so overall there are 10 conditions. Each of the 27 subjects was tested 20 times under each condition, resulting in a total of 10*27*20 = 5400 observations. RT is the response variable. The structure looks like this:
> str(mydf)
'data.frame': 5400 obs. of 4 variables:
$ subject : Factor w/ 27 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
$ congruent: logi TRUE FALSE FALSE TRUE FALSE TRUE ...
$ offset : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 5 5 1 2 5 5 2 2 3 5 ...
$ RT : int 330 343 457 436 302 311 595 330 338 374 ...
I've used daply() to calculate the mean RT of each subject in each of the 10 conditions:
myarray <- daply(mydf, .(subject, congruent, offset), summarize, mean = mean(RT))
The result looks just the way I wanted, i.e. a 3d-array; so to speak 5 tables (one for each offset condition) that show the mean of each subject in the congruent=FALSE vs. the congruent=TRUE condition.
However if I check the structure of myarray, I get a confusing output:
List of 270
$ : num 417
$ : num 393
$ : num 364
$ : num 399
$ : num 374
...
# and so on
...
[list output truncated]
- attr(*, "dim")= int [1:3] 27 2 5
- attr(*, "dimnames")=List of 3
..$ subject : chr [1:27] "1" "2" "3" "5" ...
..$ congruent: chr [1:2] "FALSE" "TRUE"
..$ offset : chr [1:5] "1" "2" "3" "4" ...
This looks totally different from the structure of the prototypical ozone array from the plyr package, even though it's a very similar format (3 dimensions, only numerical values).
I want to compute some further summarizing information on this array, by means of aaply. Precisely, I want to calculate the difference between the congruent and the incongruent means for each subject and offset.
However, already the most basic application of aaply() like aaply(myarray,2,mean) returns non-sense output:
FALSE TRUE
NA NA
Warning messages:
1: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
2: In mean.default(piece, ...) :
argument is not numeric or logical: returning NA
I have no idea, why the daply() function returns such weirdly structured output and thereby prevents any further use of aaply. Any kind of help is kindly appreciated, I frankly admit that I have hardly any experience with the plyr package.
Since you haven't included your data it's hard to know for sure, but I tried to make a dummy set off your str(). You can do what you want (I'm guessing) with two uses of ddply. First the means, then the difference of the means.
#Make dummy data
mydf <- data.frame(subject = rep(1:5, each = 150),
congruent = rep(c(TRUE, FALSE), each = 75),
offset = rep(1:5, each = 15), RT = sample(300:500, 750, replace = T))
#Make means
mydf.mean <- ddply(mydf, .(subject, congruent, offset), summarise, mean.RT = mean(RT))
#Calculate difference between congruent and incongruent
mydf.diff <- ddply(mydf.mean, .(subject, offset), summarise, diff.mean = diff(mean.RT))
head(mydf.diff)
# subject offset diff.mean
# 1 1 1 39.133333
# 2 1 2 9.200000
# 3 1 3 20.933333
# 4 1 4 -1.533333
# 5 1 5 -34.266667
# 6 2 1 -2.800000