Top 3 Box without TRUE FALSE in R / Rstudio - r

I'm new to R. I'm able to create top 3 and bottom 3 boxes in my tables, but it displays as "TRUE" and "FALSE" like this...
The code that i used is...
library(expss)
X4607 %>%
tab_cells(qcs1a_SQ001, "Top 3 Box"=qcs1a_SQ001>7 & qcs1a_SQ001<11, "Bottom 3 Box"=qcs1a_SQ001<=2) %>%
tab_cols(total(), spcode) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot()
Is there any way to just have the number of 'TRUE' come in under the "top 3 box" label and get rid of the "TRUE" and "FALSE" displaying.

There is a special function subtotal for that:
library(expss)
set.seed(123)
N = 100
X4607 = data.frame(
spcode = sample(c("South", "North"), size = N, replace = TRUE),
qcs1a_SQ001 = sample(c(1:10, 99), size = N, replace = TRUE)
)
X4607 %>%
tab_cells(subtotal(qcs1a_SQ001, "Bottom 3 Box" = 1:3, "Top 3 Box" = 7:10, position = "bottom")) %>%
tab_cols(total(), spcode) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot()
# | | #Total | spcode | |
# | | | North | South |
# | | | A | B |
# | ------------ | ------ | ------ | ------ |
# | 1 | 5.0 | 9.3 | 1.8 |
# | 2 | 5.0 | 4.7 | 5.3 |
# | 3 | 9.0 | 4.7 | 12.3 |
# | 4 | 11.0 | 7.0 | 14.0 |
# | 5 | 6.0 | 9.3 | 3.5 |
# | 6 | 10.0 | 2.3 | 15.8 A |
# | 7 | 13.0 | 16.3 | 10.5 |
# | 8 | 14.0 | 16.3 | 12.3 |
# | 9 | 10.0 | 11.6 | 8.8 |
# | 10 | 10.0 | 9.3 | 10.5 |
# | 99 | 7.0 | 9.3 | 5.3 |
# | Bottom 3 Box | 19.0 | 18.6 | 19.3 |
# | Top 3 Box | 47.0 | 53.5 | 42.1 |
# | #Total cases | 100 | 43 | 57 |

Related

How can I conditionally expand rows in my R dataframe?

I have a dataframe that I would like to expand based on a few conditions. If the Activity is "Repetitive" I would like to explode the rows to twice as long as the duration, filling in a new dataframe with a row for each 0.5 second event. The rest of the information would stay the same, except that the rows that have been expanded will alternate between the given object in the original dataframe (e.g. "Toy") and "Nothing."
Location <- c("Kitchen", "Living Room", "Living Room", "Garage")
Object <- c("Food", "Toy", "Clothes", "Floor")
Duration <- c(6,3,2,5)
CumDuration <- c(6,9,11,16)
Activity <- c("Repetitive", "Constant", "Constant", "Repetitive")
df <- data.frame(Location, Object, Duration, CumDuration, Activity)
So it looks like this:
| Location | Object | Duration | CumDuration | Activity |
| ----------- | -------- | -------- | ----------- | ---------- |
| Kitchen | Food | 6 | 6 | Repetitive |
| Living Room | Toy | 3 | 9 | Constant |
| Living Room | Clothes | 2 | 11 | Constant |
| Garage | Floor | 5 | 16 | Repetitive |
And I want it to look like this:
| Location | Object | Duration | CumDuration | Activity |
| ----------- | -------- | -------- | ----------- | ---------- |
| Kitchen | Food | 0.5 | 0.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 1 | Repetitive |
| Kitchen | Food | 0.5 | 1.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 2 | Repetitive |
| Kitchen | Food | 0.5 | 2.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 3 | Repetitive |
| Kitchen | Food | 0.5 | 3.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 4 | Repetitive |
| Kitchen | Food | 0.5 | 4.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 5 | Repetitive |
| Kitchen | Food | 0.5 | 5.5 | Repetitive |
| Kitchen | Nothing | 0.5 | 6 | Repetitive |
| Living Room | Toy | 3 | 9 | Constant |
| Living Room | Clothes | 2 | 11 | Constant |
| Garage | Floor | 0.5 | 11.5 | Repetitive |
| Garage | Nothing | 0.5 | 12 | Repetitive |
| Garage | Floor | 0.5 | 12.5 | Repetitive |
| Garage | Nothing | 0.5 | 13 | Repetitive |
| Garage | Floor | 0.5 | 13.5 | Repetitive |
| Garage | Nothing | 0.5 | 14 | Repetitive |
| Garage | Floor | 0.5 | 14.5 | Repetitive |
| Garage | Nothing | 0.5 | 15 | Repetitive |
| Garage | Floor | 0.5 | 15.5 | Repetitive |
| Garage | Nothing | 0.5 | 16 | Repetitive |
Thanks so much in advance!
Here is a dyplr option to achieve this
library(dplyr)
df$CumDuration = as.numeric(df$CumDuration)
df %>% filter(Activity == "Repetitive") %>%
group_by(Location) %>%
slice(rep(1:n(), each= Duration/0.5)) %>% # Create the new rows
mutate(Duration = Duration/(Duration*2)) %>% # Change the Duration to 0.5
ungroup() %>%
arrange(CumDuration) %>%
mutate(Object = ifelse((row_number() %% 2) == 0, "Nothing", Object), ID = 1:n()) %>% # Change the Object every other row for "Nothing" and add ID for sorting in correct order
full_join(filter(df, Activity != "Repetitive")) %>% # Merge back with the unmodified rows of original data frame
arrange(CumDuration, ID) %>% # Arrange rows in the correct order
mutate(CumDuration = cumsum(Duration)) %>% # Recalculate the cumulative sum
select(-ID) # Remove the ID column no longer wanted
# A tibble: 24 x 5
Location Object Duration CumDuration Activity
<chr> <chr> <dbl> <dbl> <chr>
1 Kitchen Food 0.5 0.5 Repetitive
2 Kitchen Nothing 0.5 1 Repetitive
3 Kitchen Food 0.5 1.5 Repetitive
4 Kitchen Nothing 0.5 2 Repetitive
5 Kitchen Food 0.5 2.5 Repetitive
6 Kitchen Nothing 0.5 3 Repetitive
7 Kitchen Food 0.5 3.5 Repetitive
8 Kitchen Nothing 0.5 4 Repetitive
9 Kitchen Food 0.5 4.5 Repetitive
10 Kitchen Nothing 0.5 5 Repetitive
# ... with 14 more rows

total() in tab_cols only sum up to one, any suggestion?

Suppose I have dataframe 'y'
WR<-c("S",'J',"T")
B<-c("b1","b2","b3")
wgt<-c(0.3,2,3)
y<-data.frame(WR,B,wgt)
I want to make column percentage crosstab with B as row, WR, and total of WR as columns using expss function
library(expss)
y %>% tab_cols(total(),WR) %>% # Columns
tab_stat_valid_n("Base") %>%
tab_weight(wgt) %>%
tab_stat_valid_n("Projection") %>%
tab_cells(mrset(B))%>% # Row
tab_stat_cpct(total_row_position = "none") %>%
tab_pivot()
Result
But the total Base column does not match up
# #Total WR|J WR|S WR|T
# Base 1.000000 1 1.0 1
# Projection 5.300000 2 0.3 3
# b1 5.660377 NA 100.0 NA
# b2 37.735849 100 NA NA
# b3 56.603774 NA NA 100
I think I found the solution
y %>% tab_cols(total(),WR) %>% # Columns
tab_cells(mrset(B))%>% # Row
tab_stat_valid_n("Base") %>%
tab_weight(wgt) %>%
tab_stat_valid_n("Projection") %>%
tab_stat_cpct(total_row_position = "none") %>%
tab_pivot()
| | | #Total | WR | | |
| | | | J | S | T |
| -- | ---------- | ------ | --- | ----- | --- |
| B | Base | 3.0 | 1 | 1.0 | 1 |
| | Projection | 5.3 | 2 | 0.3 | 3 |
| b1 | | 5.7 | | 100.0 | |
| b2 | | 37.7 | 100 | | |
| b3 | | 56.6 | | | 100 |

Using groups as table header in kableExtra

I recently discovered kableExtra. Making tables in R instead of manually entering values in a Word-table is much faster and less prone to error.
I want to make tables that have a (or several) grouping variable(s) as a header.
Basically, instead of this
iris %>%
group_by(Species) %>%
summarise(mean = mean(Sepal.Length), sd = sd(Sepal.Length)) %>%
kbl(digits = 1,format = "pipe")
|Species | mean| sd|
|:----------|----:|---:|
|setosa | 5.0| 0.4|
|versicolor | 5.9| 0.5|
|virginica | 6.6| 0.6|
Instead I want to accomplish this, or a variation of this.
| | Setosa | Versicolor | Virginica | |
|------|--------|------------|-----------|---|
| mean | 5.0 | 5.9 | 6.6 | |
| sd | 0.4 | 0.5 | 0.5 | |
| | | | | |
For multiple headers, I was thinking something in the lines of this
iris %>%
mutate(long = ifelse(Sepal.Length > 5,TRUE,FALSE)) %>%
group_by(Species,long) %>%
summarise(mean = mean(Sepal.Length), sd = sd(Sepal.Length)) %>%
kbl(digits = 1)
|Species |long | mean| sd|
|:----------|:-----|----:|---:|
|setosa |FALSE | 4.8| 0.2|
|setosa |TRUE | 5.3| 0.2|
|versicolor |FALSE | 5.0| 0.1|
|versicolor |TRUE | 6.0| 0.5|
|virginica |FALSE | 4.9| NA|
|virginica |TRUE | 6.6| 0.6|
But instead producing
| | Setosa | Setosa | Versicolor | Versicolor | Virginica | Virginica | |
|------|--------|--------|------------|------------|-----------|-----------|---|
| long | TRUE | FALSE | TRUE | FALSE | TRUE | FALSE | |
| mean | 5.3 | 4.8 | 6.0 | 5.0 | 6.6 | 4.9 | |
| sd | 0.2 | 0.2 | 0.5 | 0.1 | 0.6 | NA | |
| | | | | | | | |
Bonus points for not repeating the table header, but having a merged cell.
Can anyone point me to any examples or relevant documentation?
I am using R 4.0.2, and as such only have access to kableExtra, not kable.

How to loop over a list of variables in a dataframe, and for each variable apply a filter and get a weighted frequency table?

I have the following example survey dataset:
df <- data.frame(sex = c(1, 1, 2, 2, 1, 2, 2, 2, 1, 2),
age = c(15, 40, 97, 25, 99, 65, 20, 99, 39, 48),
nationality= c(1, 3, 1, 2, 4, 97, 2, 2, 2, 99),
employment = c(2, 1, 99, 1, 1, 1, 1, 1, 2, 2),
income = c(-1, 2500, 999997, 10000, 65000, 999998, 999999, 15000, -1, -1),
weight = c(100, 20, 400, 300, 50, 50, 80, 250, 100, 100))
The following list contains selected variables that I want to use in a for loop:
list <- list(age = df$age, employment = df$employment, income = df$income)
I want to loop over the list of selected variables in the dataframe, and for each variable in the list apply a filter (condition) and get a weighted frequency table from the filtered data. In pseudocode this is what I want to do:
for i in list {
filter(i >= 1 & i <= max(i)-2 %>%
weighted frequency of var i based on 'weight'
}
I have tried many ways to do this in R but I still can’t figure out how. The last time I used this:
library(dplyr)
library(expss)
for (i in list){
filter(i > 1 & i < max(i))-2 %>%
fre(i, weight = df$weight)
}
But I get this error message:
Error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "logical"
I need to figure out how to do it because I need to loop over a list of 256 variables.
The results must be:
library(dplyr)
library(expss)
age: <br />
F <- df %>% filter(age >= 1 & age < 97)
fre(F$age, weight = F$weight)
| F$age | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| ------ | ----- | ------------- | ------- | ------------ | ----------------------- |
| 15 | 100 | 13.3 | 13.3 | 13.3 | 13.3 |
| 20 | 80 | 10.7 | 10.7 | 10.7 | 24.0 |
| 25 | 300 | 40.0 | 40.0 | 40.0 | 64.0 |
| 39 | 100 | 13.3 | 13.3 | 13.3 | 77.3 |
| 40 | 20 | 2.7 | 2.7 | 2.7 | 80.0 |
| 48 | 100 | 13.3 | 13.3 | 13.3 | 93.3 |
| 65 | 50 | 6.7 | 6.7 | 6.7 | 100.0 |
| #Total | 750 | 100.0 | 100.0 | 100.0 | |
| <NA> | 0 | | 0.0 | | |
employment: <br />
F <- df %>% filter(employment >= 1 & employment < 97)
fre(F$employment, weight = F$weight)
| F$employment | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| ------------ | ----- | ------------- | ------- | ------------ | ----------------------- |
| 1 | 750 | 71.4 | 71.4 | 71.4 | 71.4 |
| 2 | 300 | 28.6 | 28.6 | 28.6 | 100.0 |
| #Total | 1050 | 100.0 | 100.0 | 100.0 | |
| <NA> | 0 | | 0.0 | | |
income: <br />
F <- df %>% filter(income >= 1 & income < 999997)
fre(F$income, weight = F$weight)
| F$income | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| -------- | ----- | ------------- | ------- | ------------ | ----------------------- |
| 2500 | 20 | 3.2 | 3.2 | 3.2 | 3.2 |
| 10000 | 300 | 48.4 | 48.4 | 48.4 | 51.6 |
| 15000 | 250 | 40.3 | 40.3 | 40.3 | 91.9 |
| 65000 | 50 | 8.1 | 8.1 | 8.1 | 100.0 |
| #Total | 620 | 100.0 | 100.0 | 100.0 | |
| <NA> | 0 | | 0.0 | | |
The lapply version.
library(dplyr)
library(expss)
vars_to_run <- c('age' , 'employment', 'income')
lapply(setNames(vars_to_run, vars_to_run), function(x){
z <- sym(x)
df_filter <- df %>%
filter(!!z >= 1 & !!z <= max(!!z) - 2 )
fre(df_filter[, x], weight = df_filter[, 'weight'])
})
Or using base R and [.
lapply(setNames(vars_to_run, vars_to_run), function(x){
df_filter <- df[df[x] >= 1 & df[x] <= max(df[x]) - 2, ]
expss::fre(df_filter[x], weight = df_filter[ , 'weight'])
})
Both answers return a named list
$age
| df_filter[x] | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| ------------ | ----- | ------------- | ------- | ------------ | ----------------------- |
| 15 | 100 | 8.7 | 8.7 | 8.7 | 8.7 |
| 20 | 80 | 7.0 | 7.0 | 7.0 | 15.7 |
| 25 | 300 | 26.1 | 26.1 | 26.1 | 41.7 |
| 39 | 100 | 8.7 | 8.7 | 8.7 | 50.4 |
| 40 | 20 | 1.7 | 1.7 | 1.7 | 52.2 |
| 48 | 100 | 8.7 | 8.7 | 8.7 | 60.9 |
| 65 | 50 | 4.3 | 4.3 | 4.3 | 65.2 |
| 97 | 400 | 34.8 | 34.8 | 34.8 | 100.0 |
| #Total | 1150 | 100.0 | 100.0 | 100.0 | |
| <NA> | 0 | | 0.0 | | |
$employment
| df_filter[x] | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| ------------ | ----- | ------------- | ------- | ------------ | ----------------------- |
| 1 | 750 | 71.4 | 71.4 | 71.4 | 71.4 |
| 2 | 300 | 28.6 | 28.6 | 28.6 | 100.0 |
| #Total | 1050 | 100.0 | 100.0 | 100.0 | |
| <NA> | 0 | | 0.0 | | |
$income
| df_filter[x] | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| ------------ | ----- | ------------- | ------- | ------------ | ----------------------- |
| 2500 | 20 | 2.0 | 2.0 | 2.0 | 2.0 |
| 10000 | 300 | 29.4 | 29.4 | 29.4 | 31.4 |
| 15000 | 250 | 24.5 | 24.5 | 24.5 | 55.9 |
| 65000 | 50 | 4.9 | 4.9 | 4.9 | 60.8 |
| 999997 | 400 | 39.2 | 39.2 | 39.2 | 100.0 |
| #Total | 1020 | 100.0 | 100.0 | 100.0 | |
| <NA> | 0 | | 0.0 | | |
Let's see here. You'll need to update your for loop syntax. I added parentheses and converted the index to a number. You should also initialize the object you want the loop to fill up.
Next, you'll need to convert the columns to their names and filter using rlang syntax that converts 'age' to age (standard evaluation to non-standard evaluation).
library(dplyr)
library(rlang)
df <- data.frame(sex = c(1, 1, 2, 2, 1, 2, 2, 2, 1, 2),
age = c(15, 40, 97, 25, 99, 65, 20, 99, 39, 48),
nationality= c(1, 3, 1, 2, 4, 97, 2, 2, 2, 99),
employment = c(2, 1, 99, 1, 1, 1, 1, 1, 2, 2),
income = c(-1, 2500, 999997, 10000, 65000, 999998, 999999, 15000, -1, -1),
weight = c(100, 20, 400, 300, 50, 50, 80, 250, 100, 100))
## just list the names in a vector
loop_over <- c('age' ,'employment', 'income')
## initialize the object you want the loop to fill
final <- list()
for (i in 1:length(loop_over)) {
## !!sym() coverts the column name to non-standard evaluation
temp <- df %>%
filter( !!sym(loop_over[i]) >= 1 & !!sym(loop_over[i]) <= max(!!sym(loop_over[i])) - 2 )
avg <- fre( temp[[ loop_over[i] ]], weight = temp$weight )
final[i] <- list(avg)
}

expss table with row percentage within nested variables in R

When using the expss package in R for creating tables, how does one get the row_percentages to be calculated within a nested variable? In the example below, I would like the row percentage to be calculated within each time period. Thus, I would like the row percentage to sum to 100% within each time period (2015-2016 and 2017-2018). Now however, the percentage is calculated over the entire row.
library(expss)
data(mtcars)
mtcars$period <- "2015-2016"
mtcars <- rbind(mtcars, mtcars)
mtcars$period[33:64] <- "2017-2018"
mtcars = apply_labels(mtcars,
cyl = "Number of cylinders",
am = "Transmission",
am = c("Automatic" = 0,
"Manual"=1),
period = "Measurement period"
)
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period %nest% am) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot()
Created on 2019-09-28 by the reprex package (v0.3.0)
| | | | Measurement period | | | |
| | | | 2015-2016 | | 2017-2018 | |
| | | | Transmission | | Transmission | |
| | | | Automatic | Manual | Automatic | Manual |
| ------------------- | ------------ | -------- | ------------------ | ------ | ------------ | ------ |
| Number of cylinders | 4 | row_perc | 13.6 | 36.4 | 13.6 | 36.4 |
| | 6 | row_perc | 28.6 | 21.4 | 28.6 | 21.4 |
| | 8 | row_perc | 42.9 | 7.1 | 42.9 | 7.1 |
| | #Total cases | row_perc | 19.0 | 13.0 | 19.0 | 13.0 |
I believe this is what you are after:
library(expss)
data(mtcars)
mtcars$period <- "2015-2016"
mtcars <- rbind(mtcars, mtcars)
mtcars$period[33:64] <- "2017-2018"
mtcars = apply_labels(mtcars,
cyl = "Number of cylinders",
am = "Transmission",
am = c("Automatic" = 0,
"Manual"=1),
period = "Measurement period"
)
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period %nest% am ) %>%
tab_subgroup(period =="2015-2016") %>%
tab_stat_rpct(label = "row_perc") %>%
tab_subgroup(period =="2017-2018") %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot(stat_position = "inside_rows")
Pay attention to the use of tab_subgroup() which determines which subgroup of year period we want to calculate the percentage as well as to stat_position = "inside_rows" which determines where we want to put the calculated output in the final table.
Output:
| | | | Measurement period | | | |
| | | | 2015-2016 | | 2017-2018 | |
| | | | Transmission | | Transmission | |
| | | | Automatic | Manual | Automatic | Manual |
| ------------------- | ------------ | -------- | ------------------ | ------ | ------------ | ------ |
| Number of cylinders | 4 | row_perc | 27.3 | 72.7 | | |
| | | | | | 27.3 | 72.7 |
| | 6 | row_perc | 57.1 | 42.9 | | |
| | | | | | 57.1 | 42.9 |
| | 8 | row_perc | 85.7 | 14.3 | | |
| | | | | | 85.7 | 14.3 |
| | #Total cases | row_perc | 19.0 | 13.0 | | |
| | | | | | 19.0 | 13.0 |
EDIT:
We do not need %nest% if we do not want nested rows(i.e. twice more rows). In this case, the final part of the code should be modified as follows:
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period,am) %>%
tab_subgroup(period ==c("2015-2016")) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_subgroup(period ==c("2017-2018")) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot(stat_position = "outside_columns")
Output:
| | | Measurement period | Transmission | | |
| | | 2015-2016 | Automatic | Manual | Automatic |
| | | row_perc | row_perc | row_perc | row_perc |
| ------------------- | ------------ | ------------------ | ------------ | -------- | --------- |
| Number of cylinders | 4 | 100 | 27.3 | 72.7 | 27.3 |
| | 6 | 100 | 57.1 | 42.9 | 57.1 |
| | 8 | 100 | 85.7 | 14.3 | 85.7 |
| | #Total cases | 32 | 19.0 | 13.0 | 19.0 |
| Measurement period |
Manual | 2017-2018 |
row_perc | row_perc |
-------- | ------------------ |
72.7 | 100 |
42.9 | 100 |
14.3 | 100 |
13.0 | 32 |

Resources