I have a large dataframe that I summarize using describe to create a new summary dataframe
df_sum <- describe(df[my_subset])
I check to see that df_sum has row names
has_rownames(df_sum)
[1] TRUE
Browse[2]> rownames(df_sum)
[1] "Q1" "Q2" "Q3" "Q4" "Q5"
[6] "Q6" "Q7" "Q8" "Q9" "Q10"
I now try and turn these row names into a new column
Browse[2]> rownames_to_column(df_sum, var = "Test")
Error in Math.data.frame(list(Test = c("Q1", "Q2", "Q3", "Q4", "Q5", "Q6", :
non-numeric variable(s) in data frame: Test
However, if i used the deprecated function add_rownnames, it works!
Browse[2]> add_rownames(df_sum, var = "Test")
# A tibble: 22 x 14
Test vars n mean sd median trimmed mad min max range skew kurtosis se
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Q1 1 963 5.22 2.53 5 5.29 2.97 0 10 10 -0.216 -0.615 0.0814
2 Q2 2 963 5.50 2.56 6 5.56 2.97 0 10 10 -0.240 -0.656 0.0826
3 Q3 3 963 4.82 2.72 5 4.83 2.97 0 10 10 -0.0509 -0.860 0.0878
4 Q4 4 963 4.76 3.03 5 4.73 2.97 0 10 10 -0.0102 -1.05 0.0976
5 Q5 5 963 5.07 3.10 5 5.08 4.45 0 10 10 -0.0366 -1.16 0.100
6 Q6 6 963 4.13 3.18 4 3.97 4.45 0 10 10 0.250 -1.16 0.103
7 Q7 7 963 4.89 3.14 5 4.86 4.45 0 10 10 0.0330 -1.19 0.101
8 Q8 8 963 1.83 2.71 0 1.29 0 0 10 10 1.41 0.862 0.0872
9 Q9 9 963 4.56 3.05 5 4.50 2.97 0 10 10 0.0499 -1.08 0.0982
10 Q10 10 963 4.11 2.98 4 3.95 2.97 0 10 10 0.327 -0.931 0.0962
What makes add_rownames work, when rownames_to_column fails with that cryptic error message? What do I need to do to fix rownames_to_column ?
Thanks in advance
Thomas Philips
I needed to add as.data.frame to my code.
df_sum <- describe(df[my_subset])
> class(df_sum)
[1] "psych" "describe" "data.frame"
If I apply rownames_to_column to df_sum, get the error message I mentioned earlier. However, if I type
df_sum <- as.data.frame(describe(df[my_subset]))
> class(df_sum)
[2] "data.frame"
So if I first write
df_sum <- as.data.frame(describe(df[my_subset]))
and then apply rownames_to_column to df_sum it works as expected.
Related
Suppose we have the following data:
tib <- tibble::tibble(x = 1:10)
Then, suppose we want to make a function that takes a column as input and returns a tibble with several added columns such as:
library(dplyr)
generate_transformations <- function(data, column){
transform <- sym(column)
data %>%
mutate(
sqrt = sqrt(!!transform),
recip = 1 / !!transform,
log = log(!!transform)
)
}
# Usage is great:
tib %>%
generate_transformations('x')
# A tibble: 10 x 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30
Now my question is, is there a way to avoid unquoting (!!) transform repeatedly?
Yes, I could, e.g., temporarily rename column and then rename it back after I am done, but that is not my interest in this question.
I am interested if there is a way to produce a variable that does not need the !!.
While it does not work, I was looking for something like:
generate_transformations <- function(data, column){
transform <- !!sym(column) # cannot unquote here :(
data %>%
mutate(
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform)
)
}
Convert to string and subset from the data and use transform
generate_transformations <- function(data, column){
transform <- data[[rlang::as_string(ensym(column))]]
data %>%
mutate(
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform)
)
}
-testing
tib %>%
generate_transformations('x')
# A tibble: 10 × 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30
Or create a temporary column and remove it later
generate_transformations <- function(data, column){
data %>%
mutate(transform = !! rlang::ensym(column),
sqrt = sqrt(transform),
recip = 1 / transform,
log = log(transform),
transform = NULL
)
}
-testing
tib %>%
generate_transformations('x')
# A tibble: 10 × 4
x sqrt recip log
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1.41 0.5 0.693
3 3 1.73 0.333 1.10
4 4 2 0.25 1.39
5 5 2.24 0.2 1.61
6 6 2.45 0.167 1.79
7 7 2.65 0.143 1.95
8 8 2.83 0.125 2.08
9 9 3 0.111 2.20
10 10 3.16 0.1 2.30
You can do it in one, if you swap !! for {{}} and use across:
data_transformations <- function(d, col, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{col}}, .fns=funs))
}
d %>% data_transformations(x)
# A tibble: 10 × 4
x x_sqrt x_log x_recip
<int> <dbl> <dbl> <dbl>
1 1 1 0 1
2 2 1.41 0.693 0.5
3 3 1.73 1.10 0.333
4 4 2 1.39 0.25
5 5 2.24 1.61 0.2
6 6 2.45 1.79 0.167
7 7 2.65 1.95 0.143
8 8 2.83 2.08 0.125
9 9 3 2.20 0.111
10 10 3.16 2.30 0.1
To restore your original column names, use
data_transformations <- function(d, col, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{col}}, .fns=funs, .names="{.fn}"))
}
d %>% data_transformations(x)
# A tibble: 10 × 4
x sqrt log recip
<int> <dbl> <dbl> <dbl>
1 1 1 0 1
2 2 1.41 0.693 0.5
3 3 1.73 1.10 0.333
4 4 2 1.39 0.25
5 5 2.24 1.61 0.2
6 6 2.45 1.79 0.167
7 7 2.65 1.95 0.143
8 8 2.83 2.08 0.125
9 9 3 2.20 0.111
10 10 3.16 2.30 0.1
To handle multiple columns:
data_transformations <- function(d, cols, funs=list(sqrt=sqrt, log=log, recip=~1/.)) {
d %>% mutate(across({{cols}}, .fns=funs))
}
d1 <- tibble(x=1:10, y=seq(2, 20, 2))
d1 %>% data_transformations(c(x, y), list(sqrt=sqrt, log=log))
A tibble: 10 × 6
x y x_sqrt x_log y_sqrt y_log
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0 1.41 0.693
2 2 4 1.41 0.693 2 1.39
3 3 6 1.73 1.10 2.45 1.79
4 4 8 2 1.39 2.83 2.08
5 5 10 2.24 1.61 3.16 2.30
6 6 12 2.45 1.79 3.46 2.48
7 7 14 2.65 1.95 3.74 2.64
8 8 16 2.83 2.08 4 2.77
9 9 18 3 2.20 4.24 2.89
10 10 20 3.16 2.30 4.47 3.00
I am trying to create a treatment dummy for the states whose 1970 legal1820 is different from their legal1820 in 1979. So I need the proper syntax for somethihng like this treat = ifelse((legal1820 when (year == 1970)) != (legal1820 when (year == 1979)) , 1,0)
this is the data I am using
mlda <- read_dta("http://masteringmetrics.com/wp-content/uploads/2015/01/deaths.dta")
dft <- mlda %>%
filter(year <= 1990) %>%
mutate(dtype = as_factor(dtype, levels = "labels"),
age_cat = agegr,
agegr = as_factor(agegr, levels = "labels"))
library(tidycensus)
data("fips_codes")
fips_codes <- fips_codes %>%
mutate(state_code = as.numeric(state_code)) %>%
select(state, state_code) %>%
distinct()
dft <- dft %>%
rename(state_code = state) %>%
right_join(fips_codes, by = "state_code") %>%
select(-state_code)%>%
group_by(state)%>%
filter(agegr == "18-20 yrs", year <= 1983)%>%
pivot_wider(names_from = dtype, values_from = mrate)%>%
mutate(post = ifelse(year >= 1975 ,1,0)
these are the libraries I am using (most of them are for other parts of my code)
library(tidyverse)
library(AER)
library(stargazer)
library(haven)
library(lfe)
library(estimatr)
library(stringr)
library(dplyr)
library(modelsummary)
library(ggplot2)
library(haven)
Is this what you are looking for?
library(dplyr)
mlda %>% group_by(state) %>% mutate(treat = +(first(legal1820[year == 1970] != legal1820[year == 1979])))
Output
# A tibble: 24,786 x 16
# Groups: state [51]
year state legal1820 dtype agegr count pop age legal beertaxa beerpercap winepercap spiritpercap totpercap mrate treat
<dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1970 1 0 1 [all] 1 [15-17 yrs] 224 213574 16.0 0 1.37 0.600 0.0900 0.700 1.38 105. 1
2 1971 1 0 1 [all] 1 [15-17 yrs] 241 220026 16.0 0 1.32 0.660 0.0900 0.760 1.52 110. 1
3 1972 1 0 1 [all] 1 [15-17 yrs] 270 224877 16.0 0 1.28 0.740 0.0900 0.780 1.61 120. 1
4 1973 1 0 1 [all] 1 [15-17 yrs] 258 227256 16.0 0 1.20 0.790 0.100 0.790 1.69 114. 1
5 1974 1 0 1 [all] 1 [15-17 yrs] 224 229025 16.0 0 1.08 0.830 0.160 0.810 1.80 97.8 1
6 1975 1 0.294 1 [all] 1 [15-17 yrs] 207 229739 16.0 0 0.991 0.880 0.160 0.850 1.88 90.1 1
7 1976 1 0.665 1 [all] 1 [15-17 yrs] 231 230696 16.0 0 0.937 0.890 0.150 0.860 1.89 100. 1
8 1977 1 0.668 1 [all] 1 [15-17 yrs] 219 230086 16.0 0 0.880 0.990 0.130 0.840 1.96 95.2 1
9 1978 1 0.667 1 [all] 1 [15-17 yrs] 234 229519 16.0 0 0.817 0.980 0.120 0.880 1.97 102. 1
10 1979 1 0.668 1 [all] 1 [15-17 yrs] 176 227140 16.0 0 0.734 0.980 0.120 0.840 1.94 77.5 1
# ... with 24,776 more rows
I am trying to summarise my data with ddply, and I am trying to find a way I could summarise the data while reflecting the reliability.
Here is a desciption of my data set.
BSTN ASTN BSEC ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 TFtime Ttime ID
1 1001 1003 69551 1703 1703 0 0 0 0 0 0 399 2933 35404
2 1001 1006 69664 1703 1703 0 0 0 0 0 0 399 2284 35405
3 1001 1701 66606 1703 1703 0 0 0 0 0 0 118 1750 35406
4 1001 1701 66600 1703 1703 0 0 0 0 0 0 118 1750 35406
5 1001 1701 66601 1703 1703 0 0 0 0 0 0 118 1750 35406
6 1001 1703 69434 0 0 0 0 0 0 0 0 0 1005 35407
What I would like as my output is a to summarise the values of Ttime and TFtime grouped by "ASTN"s and "BSTN"s.
For the mean values of "Ttime" and "TFtime" I would like to reflect the confidence interval 95%. So calculate the mean values of "Ttime" and "TFtime"s within the 95% boundary. How would I do this process with ddply if there are multiple combinations of BSTN~ASTNs.
below is the code I used and wish to revise.
Routetable<-ddply(A,c(.(BSTN),.(ASTN1),.(BSTN2),.(ASTN2),.(BSTN3),.(ASTN3),.(BSTN4),.(ASTN4),.(BSTN5),.(ASTN)),
summarise, count=length(BSTN),mean=mean(Ttime),TFtimemean=mean(TFtime))
updated answer
I'm not sure, but I guess what you actually want to do is filter all values that are larger / smaller than mean(x) -/+ 2*sd(x) and this by each group. The following approach would do that. In the case of ggplot2s Diamond data set it keeps about 97% of all values and just removes the extremes.
library(tidyverse)
diamonds %>%
group_by(cut, color) %>%
mutate(across(c(x,y,z),
list(low = ~ mean(.x, na.rm = TRUE) - 2 * sd(.x, na.rm = TRUE),
high = ~ mean(.x, na.rm = TRUE) + 2 * sd(.x, na.rm = TRUE))
)
) %>%
filter(x >= x_low & x <= x_high,
y >= x_low & y <= y_high,
z >= z_low & z <= z_high)
#> # A tibble: 52,299 x 16
#> # Groups: cut, color [35]
#> carat cut color clarity depth table price x y z x_low x_high
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 3.51 6.92
#> 2 0.21 Prem~ E SI1 59.8 61 326 3.89 3.84 2.31 3.52 7.65
#> 3 0.290 Prem~ I VS2 62.4 58 334 4.2 4.23 2.63 3.86 9.12
#> 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 4.14 8.62
#> 5 0.24 Very~ I VVS1 62.3 57 336 3.95 3.98 2.47 3.92 8.62
#> 6 0.26 Very~ H SI1 61.9 55 337 4.07 4.11 2.53 3.66 8.30
#> 7 0.23 Very~ H VS1 59.4 61 338 4 4.05 2.39 3.66 8.30
#> 8 0.3 Good J SI1 64 55 339 4.25 4.28 2.73 4.14 8.62
#> 9 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 3.88 8.76
#> 10 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 3.88 8.76
#> # ... with 52,289 more rows, and 4 more variables: y_low <dbl>, y_high <dbl>,
#> # z_low <dbl>, z_high <dbl>
Created on 2020-06-23 by the reprex package (v0.3.0)
old answer
With better example data we could achieve a more programmatic approach. As example I use ggplot2s diamonds dataset. See my comments in the code below.
library(tidyverse)
diamonds %>%
# set up your groups
nest_by(cut, color) %>%
# define in `across` for which variables you want means and conf int to be calculated
mutate(ttest = list(summarise(data, across(c(x,y,z),
~ broom::tidy(t.test(.x))))),
ttest = list(unpack(ttest, c(x, y, z), names_sep = "_") %>%
# select only the estimates and conf intervalls
select(contains("estimate"), contains("conf")))) %>%
unnest(ttest)
#> # A tibble: 35 x 12
#> # Groups: cut, color [35]
#> cut color data x_estimate y_estimate z_estimate x_conf.low x_conf.high
#> <ord> <ord> <list<tb> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair D [163 × 8] 6.02 5.96 3.84 5.89 6.15
#> 2 Fair E [224 × 8] 5.91 5.86 3.72 5.80 6.02
#> 3 Fair F [312 × 8] 5.99 5.93 3.79 5.89 6.09
#> 4 Fair G [314 × 8] 6.17 6.11 3.96 6.06 6.28
#> 5 Fair H [303 × 8] 6.58 6.50 4.22 6.47 6.69
#> 6 Fair I [175 × 8] 6.56 6.49 4.19 6.43 6.70
#> 7 Fair J [119 × 8] 6.75 6.68 4.32 6.55 6.95
#> 8 Good D [662 × 8] 5.62 5.63 3.50 5.55 5.69
#> 9 Good E [933 × 8] 5.62 5.63 3.50 5.56 5.68
#> 10 Good F [909 × 8] 5.69 5.71 3.54 5.63 5.76
#> # … with 25 more rows, and 4 more variables: y_conf.low <dbl>,
#> # y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>
Created on 2020-06-19 by the reprex package (v0.3.0)
If you want to filter observations based on the confidence iIntervalls of the means you can adjust my approach above as follows. Note that this is not the same as filtering the top and bottom 2.5 % of each variable, you will loose a lot of data.
library(tidyverse)
diamonds %>%
nest_by(cut, color) %>%
mutate(ttest = summarise(data, across(c(x,y,z),
~ broom::tidy(t.test(.x)))) %>%
unpack(c(x,y,z), names_sep = "_")) %>%
unpack(ttest) %>%
select(cut, color, data, contains("estimate"), contains("conf")) %>%
rowwise(cut, color) %>%
mutate(data = list(filter(data,
x >= x_conf.low & x <= x_conf.high,
y >= x_conf.low & y <= y_conf.high,
z >= z_conf.low & z <= z_conf.high))) %>%
unnest(data)
#> # A tibble: 322 x 19
#> # Groups: cut, color [30]
#> cut color carat clarity depth table price x y z x_estimate
#> <ord> <ord> <dbl> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair D 0.91 SI2 62.5 66 3079 6.08 6.01 3.78 6.02
#> 2 Fair D 0.9 SI2 65.7 60 3205 5.98 5.93 3.91 6.02
#> 3 Fair D 0.9 SI2 64.7 59 3205 6.09 5.99 3.91 6.02
#> 4 Fair D 0.95 SI2 64.4 60 3384 6.06 6.02 3.89 6.02
#> 5 Fair D 0.9 SI2 64.9 57 3473 6.03 5.98 3.9 6.02
#> 6 Fair D 0.9 SI2 64.5 61 3473 6.1 6 3.9 6.02
#> 7 Fair D 0.9 SI1 64.5 61 3689 6.05 6.01 3.89 6.02
#> 8 Fair D 0.91 SI1 64.7 61 3730 6.06 5.99 3.9 6.02
#> 9 Fair D 0.9 SI2 64.6 59 3847 6.04 6.01 3.89 6.02
#> 10 Fair D 0.91 SI1 64.4 60 3855 6.08 6.04 3.9 6.02
#> # ... with 312 more rows, and 8 more variables: y_estimate <dbl>,
#> # z_estimate <dbl>, x_conf.low <dbl>, x_conf.high <dbl>, y_conf.low <dbl>,
#> # y_conf.high <dbl>, z_conf.low <dbl>, z_conf.high <dbl>
Created on 2020-06-22 by the reprex package (v0.3.0)
Using package dplyr (which is more up-to-date than plyr) you can write as follows. "LB" and "UB" stand for "Lower Bound" and "Upper Bound" respectively.
library(dplyr)
A %>%
group_by(across(starts_with("BSTN") | starts_with("ASTN"))) %>%
summarise(
count = n(),
mean_Ttime = mean(Ttime),
mean_TFtime = mean(TFtime),
LB_Ttime = mean_Ttime - qnorm(0.975) * sd(Ttime) / sqrt(count),
UB_Ttime = mean_Ttime + qnorm(0.975) * sd(Ttime) / sqrt(count),
LB_TFtime = mean_TFtime - qnorm(0.975) * sd(TFtime) / sqrt(count),
UB_TFtime = mean_TFtime + qnorm(0.975) * sd(TFtime) / sqrt(count)
)
Output
# A tibble: 4 x 17
# Groups: BSTN, BSTN2, BSTN3, BSTN4, BSTN5, ASTN, ASTN1, ASTN2, ASTN3 [4]
# BSTN BSTN2 BSTN3 BSTN4 BSTN5 ASTN ASTN1 ASTN2 ASTN3 ASTN4 count mean_Ttime mean_TFtime LB_Ttime UB_Ttime LB_TFtime UB_TFtime
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1001 0 0 0 0 1703 0 0 0 0 1 1005 0 NA NA NA NA
# 2 1001 1703 0 0 0 1003 1703 0 0 0 1 2933 399 NA NA NA NA
# 3 1001 1703 0 0 0 1006 1703 0 0 0 1 2284 399 NA NA NA NA
# 4 1001 1703 0 0 0 1701 1703 0 0 0 3 1750 118 1750 1750 118 118
With this sample data we obtain several NAs because the count of the group is 1 in those cases, but when you have larger datasets it's rare that you will obtain them.
I am struggling with some data munging. To get to the table below I have used group_by and summarise_at to find the means of Q1-Q10 by cid and time (I started with multiple values for each cid and at each time point), then filtered down to just have cids that appear about both time 1 and 2. Using this (or going back to my raw data if there is a cleaner way) I want to count for each cid how many of the means of Q1-Q10 increased at time 2, then, for each GROUP mind the mean number of increases.
GROUP cid time Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
A 169 1 4.45 4.09 3.91 3.73 3.82 4.27 3.55 4 4.55 3.91
A 169 2 4.56 4.15 4.06 3.94 4.09 4.53 3.91 3.97 4.12 4.21
A 184 1 4.64 4.18 3.45 3.64 3.82 4.55 3.91 4.27 4 3.55
A 184 2 3.9 3.6 3 3.6 3.4 3.9 3 3.5 3.2 3.1
B 277 1 4.43 4.21 3.64 4.36 4.36 4.57 4.36 4.29 4.07 4.07
B 277 2 4.11 4 3.56 3.44 3.67 4 3.89 3.78 3.44 3.89
...
I have seen examples using spread on iris data but this was for the difference on a single variable. Any help appreciated.
Try this. Gives you the mean increase by GROUP and Qs:
df <- read.table(text = "GROUP cid time Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
A 169 1 4.45 4.09 3.91 3.73 3.82 4.27 3.55 4 4.55 3.91
A 169 2 4.56 4.15 4.06 3.94 4.09 4.53 3.91 3.97 4.12 4.21
A 184 1 4.64 4.18 3.45 3.64 3.82 4.55 3.91 4.27 4 3.55
A 184 2 3.9 3.6 3 3.6 3.4 3.9 3 3.5 3.2 3.1
B 277 1 4.43 4.21 3.64 4.36 4.36 4.57 4.36 4.29 4.07 4.07
B 277 2 4.11 4 3.56 3.44 3.67 4 3.89 3.78 3.44 3.89", header = TRUE)
library(dplyr)
library(tidyr)
df %>%
# Convert to long
pivot_longer(-c(GROUP, cid, time), names_to = "Q") %>%
# Group by GROUP, cid, Q
group_by(GROUP, cid, Q) %>%
# Just in case: sort by time
arrange(time) %>%
# Increased at time 2 using lag
mutate(is_increase = value > lag(value)) %>%
# Mean increase by GROUP and Q
group_by(GROUP, Q) %>%
summarise(mean_inc = mean(is_increase, na.rm = TRUE))
#> # A tibble: 20 x 3
#> # Groups: GROUP [2]
#> GROUP Q mean_inc
#> <fct> <chr> <dbl>
#> 1 A Q1 0.5
#> 2 A Q10 0.5
#> 3 A Q2 0.5
#> 4 A Q3 0.5
#> 5 A Q4 0.5
#> 6 A Q5 0.5
#> 7 A Q6 0.5
#> 8 A Q7 0.5
#> 9 A Q8 0
#> 10 A Q9 0
#> 11 B Q1 0
#> 12 B Q10 0
#> 13 B Q2 0
#> 14 B Q3 0
#> 15 B Q4 0
#> 16 B Q5 0
#> 17 B Q6 0
#> 18 B Q7 0
#> 19 B Q8 0
#> 20 B Q9 0
Created on 2020-04-12 by the reprex package (v0.3.0)
I have a large dataset ("bsa", drawn from a 23-year period) which includes a variable ("leftrigh") for "left-right" views (political orientation). I'd like to summarise how the cohorts change over time. For example, in 1994 the average value of this scale for people aged 45 was (say) 2.6; in 1995 the average value of this scale for people aged 46 was (say) 2.7 -- etc etc. I've created a year-of-birth variable ("yrbrn") to facilitate this.
I've successfully created the means:
bsa <- bsa %>% group_by(yrbrn, syear) %>% mutate(meanlr = mean(leftrigh))
Where I'm struggling is to summarise the means by year (of the survey) and age (at the time of the survey). If I could create an array (containing these means) organised by age x survey-year, I could see the change over time by inspecting the diagonals. But I have no clue how to do this -- my skills are very limited...
A tibble: 66,744 x 10
Groups: yrbrn [104]
Rsex Rage leftrigh OldWt syear yrbrn coh per agecat meanlr
1 1 [Male] 40 1 [left] 1.12 2017 1977 17 2017 [37,47) 2.61
2 2 [Female] 79 1.8 0.562 2017 1938 9 2017 [77,87) 2.50
3 2 [Female] 50 1.5 1.69 2017 1967 15 2017 [47,57) 2.59
4 1 [Male] 73 2 0.562 2017 1944 10 2017 [67,77) 2.57
5 2 [Female] 31 3 0.562 2017 1986 19 2017 [27,37) 2.56
6 1 [Male] 74 2.2 0.562 2017 1943 10 2017 [67,77) 2.50
7 2 [Female] 58 2 0.562 2017 1959 13 2017 [57,67) 2.56
8 1 [Male] 59 1.2 0.562 2017 1958 13 2017 [57,67) 2.53
9 2 [Female] 19 4 1.69 2017 1998 21 2017 [17,27) 2.46
Possible format for presenting this information to see change over time:
1994 1995 1996 1997 1998 1999 2000
18
19
20
21
22
23
24
25
etc.
You can group_by both age and year at the same time:
# Setup (& make reproducible data...)
n <- 10000
df1 <- data.frame(
'yrbrn' = sample(1920:1995, size = n, replace = T),
'Syear' = sample(2005:2015, size = n, replace = T),
'leftrigh' = sample(seq(0,5,0.1), size = n, replace = T))
# Solution
df1 %>%
group_by(yrbrn, Syear) %>%
summarise(meanLR = mean(leftrigh)) %>%
spread(Syear, meanLR)
Produces the following:
# A tibble: 76 x 12
# Groups: yrbrn [76]
yrbrn `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1920 3.41 1.68 2.26 2.66 3.21 2.59 2.24 2.39 2.41 2.55 3.28
2 1921 2.43 2.71 2.74 2.32 2.24 1.89 2.85 3.27 2.53 1.82 2.65
3 1922 2.28 3.02 1.39 2.33 3.25 2.09 2.35 1.83 2.09 2.57 1.95
4 1923 3.53 3.72 2.87 2.05 2.94 1.99 2.8 2.88 2.62 3.14 2.28
5 1924 1.77 2.17 2.71 2.18 2.71 2.34 2.29 1.94 2.7 2.1 1.87
6 1925 1.83 3.01 2.48 2.54 2.74 2.11 2.35 2.65 2.57 1.82 2.39
7 1926 2.43 3.2 2.53 2.64 2.12 2.71 1.49 2.28 2.4 2.73 2.18
8 1927 1.33 2.83 2.26 2.82 2.34 2.09 2.3 2.66 3.09 2.2 2.27
9 1928 2.34 2.02 2.1 2.88 2.14 2.44 2.58 1.67 2.57 3.11 2.93
10 1929 2.31 2.29 2.93 2.08 2.11 2.47 2.39 1.76 3.09 3 2.9