how to do the cluster wise summary statistics dynamically

how to do the cluster wise summary statistics dynamically - r

Suppose i have this data with cluster mapped to each row as below:
head(movies.imp)
## num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 886204 3054 723
## 2 471220 1238 302
## 3 275868 994 602
## 4 1144337 2701 813
## 5 8 127 37
## 6 212204 738 462
## movie_facebook_likes cast_total_facebook_likes cluster
## 1 33000 4834 1
## 2 0 48350 1
## 3 85000 11700 1
## 4 164000 106759 1
## 5 0 143 2
## 6 24000 1873 1
To get mean cluster wise i do:
a<-as.data.frame(aggregate( .~ cluster, FUN=mean, data=movies.imp))
a
## cluster num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 1 316561.46 831.60526 326.61773
## 2 2 44934.26 180.05922 109.69265
## 3 3 29020.10 80.20408 28.57143
## movie_facebook_likes cast_total_facebook_likes
## 1 33647.263 28282.450
## 2 3119.099 6641.746
## 3 6843.327 2426.755
Then i convert to long:
library(tidyr)
long_df.a <- gather(data=a, value = mean, key=variablenames, num_voted_users,num_user_for_reviews,num_critic_for_reviews,movie_facebook_likes,cast_total_facebook_likes)
long_df.a
long_df.a[,1]<-as.factor(long_df.a[,1]) # converting into a factor
long_df.a[,2]<-as.factor(long_df.a[,2])# converting into a factor
Then i do the same process to get median,min,max & std clusterwise
creating one dataframe of all the descriptives stats calc above
dflong<-cbind(long_df.a,long_df.b[,3],long_df.c[,3],long_df.d[,3],long_df.e[,3])
dflong<-dflong%>% set_names(c("cluster","variablenames","mean","median","min","max","sd")) # renaming the columns
head(dflong)
## cluster variablenames mean median min max
## 1 1 num_voted_users 316561.45706 263332.5 246 1689764
## 2 2 num_voted_users 44934.26451 25256.5 5 469561
## 3 3 num_voted_users 29020.10204 9277.0 15 213483
## 4 1 num_user_for_reviews 831.60526 642.0 1 5060
## 5 2 num_user_for_reviews 180.05922 129.0 1 1690
## 6 3 num_user_for_reviews 80.20408 42.0 1 394
## sd
## 1 231350.9509
## 2 53613.7994
## 3 48491.2638
## 4 659.0410
## 5 181.3630
## 6 105.1669
subsetting the data clusterwise
dflong.1<-dflong %>% filter(cluster==1)
dflong.2<-dflong %>% filter(cluster==2)
dflong.3<-dflong %>% filter(cluster==3)
dflong.combined<-rbind(dflong.1,dflong.2,dflong.3)
head(dflong.combined) # final required output
## cluster variablenames mean median min max
## 1 1 num_voted_users 316561.45706 263332.5 246 1689764
## 2 1 num_user_for_reviews 831.60526 642.0 1 5060
## 3 1 num_critic_for_reviews 326.61773 307.5 2 813
## 4 1 movie_facebook_likes 33647.26316 23000.0 0 349000
## 5 1 cast_total_facebook_likes 28282.45014 21095.0 44 656730
## 6 2 num_voted_users 44934.26451 25256.5 5 469561
## sd
## 1 231350.95086
## 2 659.04103
## 3 142.47953
## 4 37698.06583
## 5 37395.59205
## 6 53613.79942
So i am doing things in a non optimal way to get clusterwise summary stats....need help on how to sort of dynamically use loops or apply functions to get the final output in lesser lines of codes.....

I use the mtcars dataset as an example. Suppose the cyl variable is our equivalent of cluster. You can get all your summary statistics in one line of code:
d <- mtcars
s <- d %>% group_by(cyl) %>%
summarise_all(c("mean", "median", "min", "max", "sd"))
# cyl mpg_mean disp_mean hp_mean drat_mean wt_mean qsec_mean vs_mean
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909
# 2 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
# 3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
# # ... with 43 more variables: ...
It just remains for us to reshape the dataframe to get it into our desired form:
s <- gather(s, key, value, -cyl)
s <- separate(s, key, c("variable", "stat"))
d.combined <- spread(s, key = stat, value = value)
# # A tibble: 30 × 7
# cyl variable max mean median min sd
# * <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 am 1.00 0.7272727 1.00 0.000 0.4670994
# 2 4 carb 2.00 1.5454545 2.00 1.000 0.5222330
# 3 4 disp 146.70 105.1363636 108.00 71.100 26.8715937
# 4 4 drat 4.93 4.0709091 4.08 3.690 0.3654711
# 5 4 gear 5.00 4.0909091 4.00 3.000 0.5393599
# 6 4 hp 113.00 82.6363636 91.00 52.000 20.9345300
# 7 4 mpg 33.90 26.6636364 26.00 21.400 4.5098277
# 8 4 qsec 22.90 19.1372727 18.90 16.700 1.6824452
# 9 4 vs 1.00 0.9090909 1.00 0.000 0.3015113
# 10 4 wt 3.19 2.2857273 2.20 1.513 0.5695637
# # ... with 20 more rows

Related

How to mutate the slopes of lines

I have a question on how to mutate the slopes of lines into a new data frame into
by category.
d1 <-read.csv(file.choose(), header = T)
d2 <- d1 %>%
group_by(ID)%>%
mutate(Slope=sapply(split(df,df$ID), function(v) lm(x~y,v)$coefficients["y"]))
ID x y
1 3.429865279 2.431363764
1 3.595066124 2.681241237
1 3.735263469 2.352182518
1 3.316473584 2.51851394
1 3.285984642 2.380211242
1 3.860793029 2.62324929
1 3.397714117 2.819543936
1 3.452997088 2.176091259
1 3.718933278 2.556302501
1 3.518566578 2.537819095
1 3.689033452 2.40654018
1 3.349160923 2.113943352
1 3.658888644 2.556302501
1 3.251151343 2.342422681
1 3.911194909 2.439332694
1 3.432584505 2.079181246
1 4.031267043 2.681241237
1 3.168733129 1.544068044
1 4.032239897 3.084576278
1 3.663361648 2.255272505
1 3.582302046 2.62324929
1 3.606585565 2.079181246
1 3.541791347 2.176091259
4 3.844012861 2.892094603
4 3.608318477 2.767155866
4 3.588990218 2.883661435
4 3.607957917 2.653212514
4 3.306753044 2.079181246
4 4.002604841 2.880813592
4 3.195299837 2.079181246
4 3.512203238 2.643452676
4 3.66878494 2.431363764
4 3.598910385 2.511883361
4 3.721810134 2.819543936
4 3.352964661 2.113943352
4 4.008109343 3.084576278
4 3.584693332 2.556302501
4 4.019461819 3.084576278
4 3.359474563 2.079181246
4 3.950256012 2.829303773
I got the error message like'replacement has 2 rows, data has 119'. I am sure that the error is derived from mutate().
Best,

Once you do group_by, any function that succeeds uses on the columns in the grouped data.frame, in your case, it will only use x,y column within.
If you only want the coefficient, it goes like this:
df %>% group_by(ID) %>% summarize(coef=lm(x~y)$coefficients["y"])
# A tibble: 2 x 2
ID coef
<int> <dbl>
1 1 0.437
2 4 0.660
If you want the coefficient, which means a vector a long as the dataframe, you use mutate:
df %>% group_by(ID) %>% mutate(coef=lm(x~y)$coefficients["y"])
# A tibble: 40 x 4
# Groups: ID [2]
ID x y coef
<int> <dbl> <dbl> <dbl>
1 1 3.43 2.43 0.437
2 1 3.60 2.68 0.437
3 1 3.74 2.35 0.437
4 1 3.32 2.52 0.437
5 1 3.29 2.38 0.437
6 1 3.86 2.62 0.437
7 1 3.40 2.82 0.437
8 1 3.45 2.18 0.437
9 1 3.72 2.56 0.437
10 1 3.52 2.54 0.437
# … with 30 more rows

How to make grouped summary statistics based off of densities in R

Goal: I would like to generate grouped percentiles for each group (hrzn)
I have the following data
# A tibble: 3,500 x 3
hrzn parameter density
<dbl> <dbl> <dbl>
1 1 0.0183 0.00914
2 1 0.0185 0.00905
3 1 0.0187 0.00897
4 1 0.0189 0.00888
5 1 0.0191 0.00880
6 1 0.0193 0.00872
7 1 0.0194 0.00864
8 1 0.0196 0.00855
9 1 0.0198 0.00847
10 1 0.0200 0.00839
The hrzn is the group, the parameter is a grid of parameter space, and the density is the density for the value in the parameter column.
I would like to generate summary the statistics percentiles 10 to 90 by 10 by hrzn. I am trying to keep this computationally efficient. I know I could sample the parameter with the density as weights, but I am curious is there is a faster way to generate the percentiles from the density without doing a sample.
The data may be obtained with the following
df <- readr::read_csv("https://raw.githubusercontent.com/alexhallam/density_data/master/data.csv")

When I load the data from your csv, each of the 5 groups have identical values for parameter and density:
df
#># A tibble: 3,500 x 3
#> hrzn parameter density
#> <int> <dbl> <dbl>
#> 1 1 0.0183 0.00914
#> 2 1 0.0185 0.00905
#> 3 1 0.0187 0.00897
#> 4 1 0.0189 0.00888
#> 5 1 0.0191 0.00880
#> 6 1 0.0193 0.00872
#> 7 1 0.0194 0.00864
#> 8 1 0.0196 0.00855
#> 9 1 0.0198 0.00847
#>10 1 0.0200 0.00839
#># ... with 3,490 more rows
sapply(1:5, function(x) all(df$parameter[df$hrzn == x] == df$parameter[df$hrzn == 1]))
# [1] TRUE TRUE TRUE TRUE TRUE
sapply(1:5, function(x) all(df$density[df$hrzn == x] == df$density[df$hrzn == 1]))
# [1] TRUE TRUE TRUE TRUE TRUE
I'm not sure if this is a mistake or not, but clearly if you're worried about computation, anything you want to do on all the groups can be done 5 times faster by only doing it on a single group.
Anyway, to get the 10th and 90th centiles for each hrzn, you just need to see which parameter is adjacent to 0.1 and 0.9 on the cumulative distribution function. Let's generalize to working it out for all the groups in case there's an issue with the data or you want to repeat it with different data:
library(dplyr)
df %>%
mutate(hrzn = factor(hrzn)) %>%
group_by(hrzn) %>%
summarise(centile_10 = parameter[which(cumsum(density) > .1)[1]],
centile_90 = parameter[which(cumsum(density) > .9)[1]] )
#># A tibble: 5 x 3
#> hrzn centile_10 centile_90
#> <fct> <dbl> <dbl>
#>1 1 0.0204 0.200
#>2 2 0.0204 0.200
#>3 3 0.0204 0.200
#>4 4 0.0204 0.200
#>5 5 0.0204 0.200
Of course, they're all the same for the reasons mentioned above.
If you're worried about computation time (even though the above only takes a few milliseconds), and you don't mind opaque code, you could take advantage of the ordering to cut the cumsum of your entire density column between 0 and 5 in steps of 0.1, to get all the 10th centiles, like this:
summary <- df[which((diff(as.numeric(cut(cumsum(df$density), seq(0,5,.1))) - 1) != 0)) + 1,]
summary <- summary[-(1:5)*10,]
summary$centile <- rep(1:9*10, 5)
summary
#> # A tibble: 45 x 4
#> hrzn parameter density centile
#> <int> <dbl> <dbl> <dbl>
#> 1 1 0.0204 0.00824 10
#> 2 1 0.0233 0.00729 20
#> 3 1 0.0271 0.00634 30
#> 4 1 0.0321 0.00542 40
#> 5 1 0.0392 0.00453 50
#> 6 1 0.0498 0.00366 60
#> 7 1 0.0679 0.00281 70
#> 8 1 0.103 0.00199 80
#> 9 1 0.200 0.00114 90
#> 10 2 0.0204 0.00824 10
#> # ... with 35 more rows
Perhaps I have misunderstood you and you are actually working in a 5-dimensional parameter space and want to know the parameter values at the 10th and 90th centiles of 5d density. In that case, you can take advantage of the fact that all groups are the same to calculate the 10th and 90th centiles for the 5-d density by simply taking the 5th root of these two centiles:
df %>%
mutate(hrzn = factor(hrzn)) %>%
group_by(hrzn) %>%
summarise(centile_10 = parameter[which(cumsum(density) > .1^.2)[1]],
centile_90 = parameter[which(cumsum(density) > .9^.2)[1]] )
#> # A tibble: 5 x 3
#> hrzn centile_10 centile_90
#> <fct> <dbl> <dbl>
#> 1 1 0.0545 0.664
#> 2 2 0.0545 0.664
#> 3 3 0.0545 0.664
#> 4 4 0.0545 0.664
#> 5 5 0.0545 0.664

Why am I getting 'train' and 'class' have different lengths"

Why am I getting -
'train' and 'class' have different lengths
In spite of having both of them with same lengths
y_pred=knn(train=training_set[,1:2],
test=Test_set[,-3],
cl=training_set[,3],
k=5)
Their lengths are given below-
> dim(training_set[,-3])
[1] 300 2
> dim(training_set[,3])
[1] 300 1
> head(training_set)
# A tibble: 6 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -1.77 -1.47 0
2 -1.10 -0.788 0
3 -1.00 -0.360 0
4 -1.00 0.382 0
5 -0.523 2.27 1
6 -0.236 -0.160 0
> Test_set
# A tibble: 100 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -0.304 -1.51 0
2 -1.06 -0.325 0
3 -1.82 0.286 0
4 -1.25 -1.10 0
5 -1.15 -0.485 0
6 0.641 -1.32 1
7 0.735 -1.26 1
8 0.924 -1.22 1
9 0.829 -0.582 1
10 -0.871 -0.774 0

It's because knn is expecting class to be a vector and you are giving it a data table with one column. The test knn is doing is whether nrow(train) == length(cl). If cl is a data table that does not give the answer you are expecting. Compare:
> length(data.frame(a=c(1,2,3)))
[1] 1
> length(c(1,2,3))
[1] 3
If you use cl=training_set$Purchased, which extracts the vector from the table, that should fix it.
This is specific gotcha if you are moving from data.frame to data.table because the default drop behaviour is different:
> dt <- data.table(a=1:3, b=4:6)
> dt[,2]
b
1: 4
2: 5
3: 6
> df <- data.frame(a=1:3, b=4:6)
> df[,2]
[1] 4 5 6
> df[,2, drop=FALSE]
b
1 4
2 5
3 6

How to rearrange summary statistic easily viewable in dplyr?

I have a summary statistic from my dataframe:
war_3 a1_1_area_mean a1_2_area_mean a1_3_area_mean a1_4_area_mean a1_5_area_mean a1_6_area_mean
1 1 0.23827851 0.07843460 0.02531607 0.1193928 0.7635068 0.02333938
2 2 0.23162416 0.05949285 0.01422585 0.3565457 0.8593997 0.06895526
3 3 0.09187454 0.07274503 0.10357251 0.2821142 0.5929178 0.02455053
a1_7_area_mean a1_8_area_mean a1_t_area_mean a2_1_area_mean a2_2_area_mean a2_3_area_mean
1 0.005387169 0.2725867 1.526242 0.107725394 0.19406917 0.02213419
2 0.016701786 0.2222106 1.829156 0.073991405 0.03504120 0.00815826
3 0.028382414 0.1997225 1.395880 0.003634443 0.03508602 0.00000000
a2_4_area_mean a2_5_area_mean a2_t_area_mean a1_1_area_var a1_2_area_var a1_3_area_var a1_4_area_var
1 0.02024704 0.0040841950 0.34826000 1.2730028 0.13048871 0.05165589 0.1851353
2 0.07621595 0.0005078053 0.19391462 0.6114136 0.09287735 0.05697542 0.7284144
3 0.00000000 0.0000000000 0.03872046 0.1171754 0.07581946 0.35349703 0.3883895
a1_5_area_var a1_6_area_var a1_7_area_var a1_8_area_var a1_t_area_var a2_1_area_var a2_2_area_var
1 2.7640424 0.01688505 0.001459156 0.8844626 7.940393 0.57992528 1.41104857
2 2.6797714 0.05490461 0.003428341 0.5725653 8.190389 0.18087732 0.11406984
3 0.9938991 0.01801805 0.006360622 0.3405592 3.460435 0.00306776 0.06579978
a2_3_area_var a2_4_area_var a2_5_area_var a2_t_area_var a1_1_area_sd a1_2_area_sd a1_3_area_sd
1 0.067049470 0.06260921 0.0045015472 2.10734089 1.1282743 0.3612322 0.2272793
2 0.009580693 0.29505206 0.0005616327 0.85060972 0.7819294 0.3047579 0.2386952
3 0.000000000 0.00000000 0.0000000000 0.06861217 0.3423089 0.2753533 0.5945562
a1_4_area_sd a1_5_area_sd a1_6_area_sd a1_7_area_sd a1_8_area_sd a1_t_area_sd a2_1_area_sd
1 0.4302735 1.6625410 0.1299425 0.03819890 0.9404587 2.817870 0.76152825
2 0.8534719 1.6370007 0.2343173 0.05855204 0.7566805 2.861886 0.42529674
3 0.6232090 0.9969449 0.1342313 0.07975351 0.5835745 1.860224 0.05538736
a2_2_area_sd a2_3_area_sd a2_4_area_sd a2_5_area_sd a2_t_area_sd
1 1.1878757 0.25893912 0.2502183 0.06709357 1.4516683
2 0.3377423 0.09788102 0.5431869 0.02369879 0.9222851
3 0.2565147 0.00000000 0.0000000 0.00000000 0.2619392
Above summary table is from following scripts and original data frame as below:
uid war_3 a1_1_area a1_2_area a1_3_area a1_4_area a1_5_area a1_6_area a1_7_area a1_8_area a1_t_area
1 1001 1 0 0.00000 0 0.67048 0.0000 0.02088 0 0.00000 0.69136
2 1002 2 0 0.00000 0 0.00000 0.9019 0.14493 0 0.00000 1.04683
3 1003 2 0 0.00000 0 0.00000 0.9019 0.00000 0 0.00000 0.90190
4 1004 2 0 1.09322 0 0.00000 0.0000 0.00000 0 0.00000 1.09322
5 1005 3 0 1.75000 0 0.00000 0.0000 0.00000 0 0.00000 1.75000
6 1006 2 0 2.43442 0 0.32223 0.0000 0.00000 0 0.76801 3.52466
a2_1_area a2_2_area a2_3_area a2_4_area a2_5_area a2_t_area
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
summary <- df.anov %>% select(-uid) %>% group_by(war_3,) %>%
summarize_each(funs(min,max,mean,median,var,sd)))
However, as it is difficult to compare each value in pairs of war_3 (group) by mean, var and sd, I would like to transform it into the following format:
variable war_3 mean variance s.d.
a1_1_area, 1 , x , x , x
a1_1_area, 2 , x , x , x
a1_1_area, 3 , x , x , x
a1_2_area, 1 , x , x , x
a1_2_area, 2 , x , x , x
a1_2_area, 3 , x , x , x
a1_3_area, 1 , x , x , x
a1_3_area, 2 , x , x , x
a1_3_area, 3 , x , x , x
a1_4_area, 1 , x , x , x
a1_4_area, 2 , x , x , x
a1_4_area, 3 , x , x , x
(it continues until `a2_5_area` in `variable`)
I used to use gather in dplyr to rearrange wide-format into long-format for simple dataframe, however this dataframe requires more complecated operation which may require repetitive select(matches()) or so.
variables are:
war_3 variable to group each record (it is already grouped by group_by(war_3) %>% summarize_each(funs(mean,var,sd)) in the previous operation)
aX_Y_area_Z: where X has two values as 1 and 2, Y spreads 1-8 for X=1 and 1-5 for X=2. Z has three statistics as mean, variance and s.d..
Could you help me to make it possible?
I prefer to use dplyr piping rather than data.table() solution.
Following scripts are very manual way but makes duplicated records in each gather()and I do not want to specify neither each column number nor name manually.
summary %>%
gather(key1,mean,
a1_1_area_mean,a1_2_area_mean,a1_3_area_mean,a1_4_area_mean,
a1_5_area_mean,a1_6_area_mean,a1_7_area_mean,a1_8_area_mean,
a1_t_area_mean,a2_1_area_mean,a2_2_area_mean,a2_3_area_mean,
a2_4_area_mean,a2_5_area_mean,a2_t_area_mean) %>%
gather(key2,var,
a1_1_area_var,a1_2_area_var,a1_3_area_var,a1_4_area_var,
a1_5_area_var,a1_6_area_var,a1_7_area_var,a1_8_area_var,
a1_t_area_var,a2_1_area_var,a2_2_area_var,a2_3_area_var,
a2_4_area_var,a2_5_area_var,a2_t_area_var) %>%
gather(key3,sd,
a1_1_area_sd,a1_2_area_sd,a1_3_area_sd,a1_4_area_sd,
a1_5_area_sd,a1_6_area_sd,a1_7_area_sd,a1_8_area_sd,
a1_t_area_sd,a2_1_area_sd,a2_2_area_sd,a2_3_area_sd,
a2_4_area_sd,a2_5_area_sd,a2_t_area_sd) %>%
mutate_at(vars(key1),funs(str_sub(.,1,9))) %>% select(-key2,-key3) %>%
rename(key=key1) -> summary2

Since you provided no easy to copy & paste sample data, I produced some by my own
library(tidyverse)
data <- mtcars %>%
group_by(cyl) %>%
mutate(disp_1 = disp, disp_2=disp, mpg_1 = mpg, mpg_2 = mpg, drat_1=drat, drat_2=drat) %>%
select(-disp, -mpg, -drat) %>%
summarise_at(vars(contains("mpg"),contains("disp"), contains("drat")), list(mean =mean, sd = sd))
data
# A tibble: 3 x 13
cyl mpg_1_mean mpg_2_mean disp_1_mean disp_2_mean drat_1_mean drat_2_mean mpg_1_sd
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 26.7 105. 105. 4.07 4.07 4.51
2 6 19.7 19.7 183. 183. 3.59 3.59 1.45
3 8 15.1 15.1 353. 353. 3.23 3.23 2.56
# ... with 5 more variables: mpg_2_sd <dbl>, disp_1_sd <dbl>, disp_2_sd <dbl>,
# drat_1_sd <dbl>, drat_2_sd <dbl>
then, simply gather, separate and spread
data %>%
gather(key, value, -cyl) %>%
separate(key, into = letters[1:3]) %>%
spread(c, value)
# A tibble: 18 x 5
cyl a b mean sd
<dbl> <chr> <chr> <dbl> <dbl>
1 4 disp 1 105. 26.9
2 4 disp 2 105. 26.9
3 4 drat 1 4.07 0.365
4 4 drat 2 4.07 0.365
5 4 mpg 1 26.7 4.51
6 4 mpg 2 26.7 4.51
7 6 disp 1 183. 41.6
8 6 disp 2 183. 41.6
9 6 drat 1 3.59 0.476
10 6 drat 2 3.59 0.476
11 6 mpg 1 19.7 1.45
12 6 mpg 2 19.7 1.45
13 8 disp 1 353. 67.8
14 8 disp 2 353. 67.8
15 8 drat 1 3.23 0.372
16 8 drat 2 3.23 0.372
17 8 mpg 1 15.1 2.56
18 8 mpg 2 15.1 2.56

Create Multiple 2-dimensional Tables from Multiple Columns in R Using dplyr

I'm looking for an efficient way to create multiple 2-dimension tables from an R dataframe of chi-square statistics. The code below builds on this answer to a previous question of mine about getting chi-square stats by groups. Now I want to create tables from the output by group. Here's what I have so far using the hsbdemo data frame from the UCLA R site:
ml <- foreign::read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
str(ml)
'data.frame': 200 obs. of 13 variables:
$ id : num 45 108 15 67 153 51 164 133 2 53 ...
$ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
$ ses : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
$ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
$ prog : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...
ml %>%
dplyr::select(prog, ses, schtyp) %>%
table() %>%
apply(3, chisq.test, simulate.p.value = TRUE) %>%
lapply(`[`, c(6,7,9)) %>%
reshape2::melt() %>%
tidyr::spread(key = L2, value = value) %>%
dplyr::rename(SchoolType = L1) %>%
dplyr::arrange(SchoolType, prog) %>%
dplyr::select(-observed, -expected) %>%
reshape2::acast(., prog ~ ses ~ SchoolType ) %>%
tbl_df()
The output after the last arrange statement produces this tibble (showing only the first five rows):
prog ses SchoolType expected observed stdres
1 general low private 0.37500 2 3.0404678
2 general middle private 3.56250 3 -0.5187244
3 general high private 2.06250 1 -1.0131777
4 academic low private 1.50000 0 -2.5298221
5 academic middle private 14.25000 14 -0.2078097
It's easy to select one column, for example, stdres, and pass it to acast and tbl_df, which gets pretty much what I'm after:
# A tibble: 3 x 6
low.private middle.private high.private low.public middle.public high.public
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3.04 -0.519 -1.01 1.47 -0.236 -1.18
2 -2.53 -0.208 1.50 -0.940 -2.06 3.21
3 -0.377 1.21 -1.06 -0.331 2.50 -2.45
Now I can repeat these steps for observed and expected frequencies and bind them by rows, but that seems inefficient. The output would observed frequencies stacked on expected, stacked on the standardized residuals. Something like this:
low.private middle.private high.private low.public middle.public high.public
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 3 1 14 17 8
2 0 14 10 19 30 32
3 0 2 0 12 29 7
4 0.375 3.56 2.06 10.4 17.6 10.9
5 1.5 14.2 8.25 21.7 36.6 22.7
6 0.125 1.19 0.688 12.9 21.7 13.4
7 3.04 -0.519 -1.01 1.47 -0.236 -1.18
8 -2.53 -0.208 1.50 -0.940 -2.06 3.21
9 -0.377 1.21 -1.06 -0.331 2.50 -2.45
Seems there ought to be a way to do this without repeating the code for each column, probably by creating and processing a list. Thanks in advance.

Might this be the answer?
ml1 <- ml %>%
dplyr::select(prog, ses, schtyp) %>%
table() %>%
apply(3, chisq.test, simulate.p.value = TRUE) %>%
lapply(`[`, c(6,7,9)) %>%
reshape2::melt()
ml2 <- ml1 %>%
dplyr::mutate(type=paste(ses, L1, sep=".")) %>%
dplyr::select(-ses, -L1) %>%
tidyr::spread(type, value)
This gives you
prog L2 high.private high.public low.private low.public middle.private middle.public
1 general expected 2.062500 10.910714 0.3750000 10.4464286 3.5625000 17.6428571
2 general observed 1.000000 8.000000 2.0000000 14.0000000 3.0000000 17.0000000
3 general stdres -1.013178 -1.184936 3.0404678 1.4663681 -0.5187244 -0.2360209
4 academic expected 8.250000 22.660714 1.5000000 21.6964286 14.2500000 36.6428571
5 academic observed 10.000000 32.000000 0.0000000 19.0000000 14.0000000 30.0000000
6 academic stdres 1.504203 3.212431 -2.5298221 -0.9401386 -0.2078097 -2.0607058
7 vocation expected 0.687500 13.428571 0.1250000 12.8571429 1.1875000 21.7142857
8 vocation observed 0.000000 7.000000 0.0000000 12.0000000 2.0000000 29.0000000
9 vocation stdres -1.057100 -2.445826 -0.3771236 -0.3305575 1.2081594 2.4999085
I am not sure I understand completely what you are out after... But basically, create a new variable of SES and school type, and gather based on that. And obviously, reorder it as you wish :-)

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to do the cluster wise summary statistics dynamically - r

Related

How to mutate the slopes of lines

How to make grouped summary statistics based off of densities in R

Why am I getting 'train' and 'class' have different lengths"

How to rearrange summary statistic easily viewable in dplyr?

Create Multiple 2-dimensional Tables from Multiple Columns in R Using dplyr

Categories

Resources