Pivoting CreateTableOne in R to show levels as column headers? - r

I'm trying to generate some descriptive summary tables in R using the CreateTableOne function. I have a series of variables that all have the same response options/levels (Yes or No), and want to generate a wide table where the levels are column headings, like this:
Variable
Yes
No
Var1
1
7
Var2
5
2
But CreateTableOne generates nested long tables, with one column for Level where Yes and No are values in rows, like this:
Variable
Level
Value
Var1
Yes
1
Var1
No
7
Is there a way to pivot the table to get what I want while still using this function, or is there a different function I should be using instead?
Here is my current code:
vars <- c('var1', 'var2')
Table <- CreateTableOne(vars=vars, data=dataframe, factorVars=vars)
Table_exp <- print(Table, showAllLevels = T, varLabels = T, format="f", test=FALSE, noSpaces = TRUE, printToggle = FALSE)
write.csv(Table_exp, file = "Table.csv")
Thanks!

You could use only the pivot_wider to make that table. Here is your data:
library(tidyverse)
dataframe = data.frame(Variable = c("Var1", "Var1", "Var2", "Var2"),
Level = c("Yes", "No", "Yes", "No"),
Value = c(1, 7, 5, 2))
Your data:
Variable Level Value
1 Var1 Yes 1
2 Var1 No 7
3 Var2 Yes 5
4 Var2 No 2
You can use this code to make the wider table:
dataframe %>%
pivot_wider(names_from = "Level", values_from = "Value")
Output:
# A tibble: 2 × 3
Variable Yes No
<chr> <dbl> <dbl>
1 Var1 1 7
2 Var2 5 2

So I got the answer to this question from a coworker, and it's very similar to what Quinten suggested but with some additional steps to account for the structure of my raw data.
The example tables I provided in my question were my desired outputs, not examples of my raw data. The number values weren't values in my dataset, but rather calculated counts of records, and the solution below includes steps for doing that calculation.
This is what my raw data looks like, and it's actually structured wide:
Participant_ID
Var1
Var2
Age
1
Yes
No
20
2
No
No
30
We started by creating a subset with just the relevant variables:
subset <- data |> select(Participant_Id, Var1, Var2)
Then pivoted the data longer first, in order to calculate the counts I wanted in my output table. In this code, we specify that we don't want to pivot Participant_ID and create columns called Vars and Response.
subsetlong <- subset |> pivot_longer(-c("Participant_Id"), names_to = "Vars", values_to="Response")
This is what subsetlong looks like:
Participant_ID
Vars
Response
1
Var1
Yes
1
Var2
No
2
Var1
No
2
Var2
No
Then we calculated the counts by Vars, putting that into a new dataframe called counts:
counts <- subsetlong |> group_by(Vars) |> count(Response)
And this is what counts looks like:
Vars
Response
n
Var1
Yes
1
Var1
No
7
Var2
Yes
5
Var2
No
2
Now that the calculation was done, we pivoted this back to wide again, specifying that any NAs should appear as 0s:
counts_wide <- counts |> pivot_wider(names_from="Response", values_from="n", values_fill = 0)
And finally got the desired structure:
Vars
Yes
No
Var1
1
7
Var2
5
2

Related

How to run a custom function in R multiple times using a data frame to get the argument for each run

I wrote a function to output all values for a specified categorical variable into an output data frame. I'm more used to using call execute in SAS. This works if I manually write out the function with the variable name, but it does not if I try to use mapply. I'm not very experienced (yet) using apply, lapply, mapply, etc, but I have a lot of values I want to run this on (so would like to use a dataframe to specify the function arguments).
Does anyone have any suggestions? REPREX below:
This works (outputs a table listing all variables and all values associated with each one):
a<-data.frame(var1=c("one","two","three"),var2=c("ants","moths","cows"),var3=c("Sam","Sally","Jugdish"))
b<-data.frame(VNAME=c("var1","var2","var3"))
getvals<-function(varb){
temp<-a %>% mutate(VNAME=quo_name(enquo(varb))) %>% mutate(VALUE={{varb}}) %>% select(c(VNAME,VALUE)) %>% distinct()
Values<-bind_rows(Values,temp)
Values<-Values %>% filter(VNAME != 'delete' & !is.na(VALUE))
Values<<-Values
}
Values<-data.frame(VNAME='delete')
getvals(var1)
getvals(var2)
getvals(var3)
But this does not - in fact it just outputs a table listing the variables in both columns)
a<-data.frame(var1=c("one","two","three"),var2=c("ants","moths","cows"),var3=c("Sam","Sally","Jugdish"))
b<-data.frame(VNAME=c("var1","var2","var3"))
getvals<-function(varb){
temp<-a %>% mutate(VNAME=quo_name(enquo(varb))) %>% mutate(VALUE={{varb}}) %>% select(c(VNAME,VALUE)) %>% distinct()
Values<-bind_rows(Values,temp)
Values<-Values %>% filter(VNAME != 'delete' & !is.na(VALUE))
Values<<-Values
}
Values<-data.frame(VNAME='delete')
mapply(getvals,b$VNAME)
Thank you!
I tried using apply and lapply, but got the same results.
tidyr::pivot_longer(a, everything())
Result
# A tibble: 9 × 2
name value
<chr> <chr>
1 var1 one
2 var2 ants
3 var3 Sam
4 var1 two
5 var2 moths
6 var3 Sally
7 var1 three
8 var2 cows
9 var3 Jugdish
Or if you just want certain variables included:
include <- c("var1", "var3")
tidyr::pivot_longer(dplyr::select(a, include), everything())
# A tibble: 6 × 2
name value
<chr> <chr>
1 var1 one
2 var3 Sam
3 var1 two
4 var3 Sally
5 var1 three
6 var3 Jugdish
Or as a function for the same output:
extract_vars <- function(df, cols) {
tidyr::pivot_longer(dplyr::select(df, cols), everything())
}
extract_vars(a, include)

finding the minimum value of multiple variables by group

I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))
How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8
We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7
Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")

To create a frequency table with dplyr to count the factor levels and missing values and report it

Some questions are similar to this topic (here or here, as an example) and I know one solution that works, but I want a more elegant response.
I work in epidemiology and I have variables 1 and 0 (or NA). Example:
Does patient has cancer?
NA or 0 is no
1 is yes
Let's say I have several variables in my dataset and I want to count only variables with "1". Its a classical frequency table, but dplyr are turning things more complicated than I could imagine at the first glance.
My code is working:
dataset %>%
select(VISimpair, HEARimpai, IntDis, PhyDis, EmBehDis, LearnDis,
ComDis, ASD, HealthImpair, DevDelays) %>% # replace to your needs
summarise_all(funs(sum(1-is.na(.))))
And you can reproduce this code here:
library(tidyverse)
dataset <- data.frame(var1 = rep(c(NA,1),100), var2=rep(c(NA,1),100))
dataset %>% select(var1, var2) %>% summarise_all(funs(sum(1-is.na(.))))
But I really want to select all variables I want, count how many 0 (or NA) I have and how many 1 I have and report it and have this output
Thanks.
What about the following frequency table per variable?
First, I edit your sample data to also include 0's and load the necessary libraries.
library(tidyr)
library(dplyr)
dataset <- data.frame(var1 = rep(c(NA,1,0),100), var2=rep(c(NA,1,0),100))
Second, I convert the data using gather to make it easier to group_by later for the frequency table created by count, as mentioned by CPak.
dataset %>%
select(var1, var2) %>%
gather(var, val) %>%
mutate(val = factor(val)) %>%
group_by(var, val) %>%
count()
# A tibble: 6 x 3
# Groups: var, val [6]
var val n
<chr> <fct> <int>
1 var1 0 100
2 var1 1 100
3 var1 NA 100
4 var2 0 100
5 var2 1 100
6 var2 NA 100
A quick and dirty method to do this is to coerce your input into factors:
dataset$var1 = as.factor(dataset$var1)
dataset$var2 = as.factor(dataset$var2)
summary(dataset$var1)
summary(dataset$var2)
Summary tells you number of occurrences of each levels of factor.

How to sum total of elements in several columns of factor type that are not empty?

Provided a data frame like this one:
df <- data.frame(list(Group = c("Group1", "Group1", "Group2", "Group2"),
A=c("Some text", "Text here too", "Some other text", NA),
B=c(NA, "Some random text", NA, "Random here too")))
> df
Group A B
1 Group1 Some text <NA>
2 Group1 Text here too Some random text
3 Group2 Some other text <NA>
4 Group2 <NA> Random here too
I would like to sum all values in columns A and B that have some values and then sum them per each group independently, resulting in the following data frame:
> df.expected
Group A_n B_n
1 Group1 2 1
2 Group2 1 1
Although this is a silly data frame example (the original data frame has far more columns and groups and it's not so easy to manually achieve the results), I am not succeeding due to the fact that I can't operate with factors. Additionally, I'm afraid my approach (see below) is too verbose and maybe overkill, and it makes it not very suitable for my real data frame, with far more columns.
That's what I've done so far:
# Manually create a new numeric column with numbers.
df$A_n = as.character(df$A)
df$A_n[!is.na(df$A_n)] <- 1
df$A_n = as.numeric(df$A_n)
df$B_n = as.character(df$B)
df$B_n[!is.na(df$B_n)] <- 1
df$B_n = as.numeric(df$B_n)
This part is working fine, although I'm afraid there might be a better and shorter/semiautomated way to create new columns and assign them a value. Maybe it's even unnecessary.
The second part of my code is aimed to group the observations according to a grouping variable and sum the values in each variable using dplyr:
library(dplyr)
df2 = df %>%
select(Group, A_n, B_n) %>%
group_by(Group) %>%
summarise_all(sum)
However, I am getting unexpected data frame:
> df2
# A tibble: 2 x 3
Group A_n B_n
<fctr> <dbl> <dbl>
1 Group1 2 NA
2 Group2 NA NA
Can anyone help me in how to tackle this problem in a better way and/or tell me what am I doing wrong with dplyr's code block?
What am I doing wrong with dplyr's code block?
It's because there are NAs. Try
library(dplyr)
df2 = df %>%
select(Group, A_n, B_n) %>%
group_by(Group) %>%
summarise_all(sum, na.rm=TRUE)
instead.
Output on my machine:
# A tibble: 2 x 3
Group A_n B_n
<fctr> <dbl> <dbl>
1 Group1 2 1
2 Group2 1 1
I'm afraid my approach ... is too verbose and maybe overkill
You can just do this:
df <- data.frame(list(Group = c("Group1", "Group1", "Group2", "Group2"),
A=c("Some text", "Text here too", "Some other text", NA),
B=c(NA, "Some random text", NA, "Random here too")))
library(dplyr)
df2 = df %>%
group_by(Group) %>%
summarise_all(.funs=function(x) length(na.omit(x)))
Output on my machine:
# A tibble: 2 x 3
Group A B
<fctr> <int> <int>
1 Group1 2 1
2 Group2 1 1
A little explanation
If you look at help(summarise_all), you'll see its arguments are .tbl, .funs, and ... (which we won't worry about the ellipses for now). So, we feed df into group_by() using the pipe %>%, then feed that into summarise_all(), again using the pipe %>%. That takes care of the .tbl argument. The .funs argument is how you specify what function(s) should be used to summarise to all non-grouping columns in .tbl. Here we want to know how many elements of each column is not NA, which we can do (as one approach) by applying length(na.omit(x)) to each non-grouping column x in .tbl.
My best suggestion for a resource to learn about dplyr is Chapter 5 of R for Data Science, a book by Hadley Wickham, who wrote the dplyr package (among many others).
In base R, you can use aggregate with the standard interface (as opposed to the formula interface).
aggregate(cbind(A_n=df$A, B_n=df$B), df["Group"], function(x) sum(!is.na(x)))
Group A_n B_n
1 Group1 2 1
2 Group2 1 1
cbind the variables to be calculated and provide there names. In the second argument, include the grouping variables. Then, as you function, sum over na indicator of elements that are not missing.

create matrix of z-scores in R

I have a data.frame containing survey data on three binary variables. The data is already in a contingency table with the first 3 columns being answers (1=yes, 0 = no) and the fourth column showing the total number of answers. The rows is three different groups.
My aim is to calulate z-scores to check if the proportions are significantly different compared to the total
this is my data:
library(dplyr) #loading libraries
df <- structure(list(var1 = c(416, 1300, 479, 417),
var2 = c(265, 925,473, 279),
var3 = c(340, 1013, 344, 284),
totalN = c(1366, 4311,1904, 1233)),
class = "data.frame",
row.names = c(NA, -4L),
.Names = c("var1","var2", "var3", "totalN"))
and these are my total values
dfTotal <- df %>% summarise_all(funs(sum(., na.rm=TRUE)))
dfTotal
dfTotal <- data.frame(dfTotal)
rownames(dfTotal) <- "Total"
to calculate zScore I use the following formula:
zScore <- function (cntA, totA, cntB, totB) {
#calculate
avgProportion <- (cntA + cntB) / (totA + totB)
probA <- cntA/totA
probB <- cntB/totB
SE <- sqrt(avgProportion * (1-avgProportion)*(1/totA + 1/totB))
zScore <- (probA-probB) / SE
return (zScore)
}
is there a way using dplyr to calculate a 4x3 matrix that holds for all four groups and variables var1 to var3 the z-test-value against the total proportion?
I am currently stuck with this bit of code:
df %>% mutate_all(funs(zScore(., totalN,dftotal$var1,dfTotal$totalN)))
So the parameters currently used here as dftotal$var1 and dfTotal$totalN don't work, but I have no idea how to feed them into the formula. for the first parameter it must not be always var1 but should be var2, var3 (and totalN) to match the first parameter.
z-score in R is handled with scale:
scale(df)
var1 var2 var3 totalN
[1,] -0.5481814 -0.71592544 -0.4483732 -0.5837722
[2,] 1.4965122 1.42698064 1.4952995 1.4690147
[3,] -0.4024623 -0.04058534 -0.4368209 -0.2087639
[4,] -0.5458684 -0.67046986 -0.6101053 -0.6764787
If you want only the three var columns:
scale(df[,1:3])
var1 var2 var3
[1,] -0.5481814 -0.71592544 -0.4483732
[2,] 1.4965122 1.42698064 1.4952995
[3,] -0.4024623 -0.04058534 -0.4368209
[4,] -0.5458684 -0.67046986 -0.6101053
If you want to use your zScore function inside a dplyr pipeline, we'll need to tidy your data first and add new variables containing the values you now have in dfTotal:
library(dplyr)
library(tidyr)
# add grouping variables we'll need further down
df %>% mutate(group = 1:4) %>%
# reshape data to long format
gather(question,count,-group,-totalN) %>%
# add totals by question to df
group_by(question) %>%
mutate(answers = sum(totalN),
yes = sum(count)) %>%
# calculate z-scores by group against total
group_by(group,question) %>%
summarise(z_score = zScore(count, totalN, yes, answers)) %>%
# spread to wide format
spread(question, z_score)
## A tibble: 4 x 4
# group var1 var2 var3
#* <int> <dbl> <dbl> <dbl>
#1 1 0.6162943 -2.1978303 1.979278
#2 2 0.6125615 -0.7505797 1.311001
#3 3 -3.9106430 2.6607258 -4.232391
#4 4 2.9995381 0.4712734 0.438899

Resources