create matrix of z-scores in R - r

I have a data.frame containing survey data on three binary variables. The data is already in a contingency table with the first 3 columns being answers (1=yes, 0 = no) and the fourth column showing the total number of answers. The rows is three different groups.
My aim is to calulate z-scores to check if the proportions are significantly different compared to the total
this is my data:
library(dplyr) #loading libraries
df <- structure(list(var1 = c(416, 1300, 479, 417),
var2 = c(265, 925,473, 279),
var3 = c(340, 1013, 344, 284),
totalN = c(1366, 4311,1904, 1233)),
class = "data.frame",
row.names = c(NA, -4L),
.Names = c("var1","var2", "var3", "totalN"))
and these are my total values
dfTotal <- df %>% summarise_all(funs(sum(., na.rm=TRUE)))
dfTotal
dfTotal <- data.frame(dfTotal)
rownames(dfTotal) <- "Total"
to calculate zScore I use the following formula:
zScore <- function (cntA, totA, cntB, totB) {
#calculate
avgProportion <- (cntA + cntB) / (totA + totB)
probA <- cntA/totA
probB <- cntB/totB
SE <- sqrt(avgProportion * (1-avgProportion)*(1/totA + 1/totB))
zScore <- (probA-probB) / SE
return (zScore)
}
is there a way using dplyr to calculate a 4x3 matrix that holds for all four groups and variables var1 to var3 the z-test-value against the total proportion?
I am currently stuck with this bit of code:
df %>% mutate_all(funs(zScore(., totalN,dftotal$var1,dfTotal$totalN)))
So the parameters currently used here as dftotal$var1 and dfTotal$totalN don't work, but I have no idea how to feed them into the formula. for the first parameter it must not be always var1 but should be var2, var3 (and totalN) to match the first parameter.

z-score in R is handled with scale:
scale(df)
var1 var2 var3 totalN
[1,] -0.5481814 -0.71592544 -0.4483732 -0.5837722
[2,] 1.4965122 1.42698064 1.4952995 1.4690147
[3,] -0.4024623 -0.04058534 -0.4368209 -0.2087639
[4,] -0.5458684 -0.67046986 -0.6101053 -0.6764787
If you want only the three var columns:
scale(df[,1:3])
var1 var2 var3
[1,] -0.5481814 -0.71592544 -0.4483732
[2,] 1.4965122 1.42698064 1.4952995
[3,] -0.4024623 -0.04058534 -0.4368209
[4,] -0.5458684 -0.67046986 -0.6101053

If you want to use your zScore function inside a dplyr pipeline, we'll need to tidy your data first and add new variables containing the values you now have in dfTotal:
library(dplyr)
library(tidyr)
# add grouping variables we'll need further down
df %>% mutate(group = 1:4) %>%
# reshape data to long format
gather(question,count,-group,-totalN) %>%
# add totals by question to df
group_by(question) %>%
mutate(answers = sum(totalN),
yes = sum(count)) %>%
# calculate z-scores by group against total
group_by(group,question) %>%
summarise(z_score = zScore(count, totalN, yes, answers)) %>%
# spread to wide format
spread(question, z_score)
## A tibble: 4 x 4
# group var1 var2 var3
#* <int> <dbl> <dbl> <dbl>
#1 1 0.6162943 -2.1978303 1.979278
#2 2 0.6125615 -0.7505797 1.311001
#3 3 -3.9106430 2.6607258 -4.232391
#4 4 2.9995381 0.4712734 0.438899

Related

How to Make column names the first row in r?

I have seen may tutorials on making the first row the column names, but nothing explaining how to do the reverse. I would like to have my column names as the first row values and change the column names to something like var1, var2, var3. Can this be done?
I plan to row bind a bunch of data frames later, so they all need the same column names.
q = structure(list(shootings_per100k = c(8105.47466098618, 6925.42653239307
), lawtotal = c(3.00137104283906, 0.903522788541896), felony = c(0.787418655097614,
0.409578330883717)), row.names = c("mean", "sd"), class = "data.frame")
Have:
shootings_per100k lawtotal felony
mean 8105.475 3.0013710 0.7874187
sd 6925.427 0.9035228 0.4095783
Want:
var1 var2 var3
var shootings_per100k lawtotal felony
mean 8105.475 3.0013710 0.7874187
sd 6925.427 0.9035228 0.4095783
edit: I just realized that since I plan to row bind several data frames later, it may be best for them to all have the same column names. I changed the 'want' section to reflect the desired outcome.
q <- rbind(colnames(q), round(q, 4))
colnames(q) <- paste0("var", seq_len(ncol(q)))
rownames(q)[1] <- "var"
q
var1 var2 var3
var shootings_per100k lawtotal felony
mean 8105.4747 3.0014 0.7874
sd 6925.4265 0.9035 0.4096
you can used the names() function to extract column names of the data frame and row bind that to your data frame. Then use the names() function again to override the existing names to any standard value you want.
You can also do as follows.
library(tidyverse)
names <- enframe(colnames(df)) %>%
pivot_wider(-name, names_from = value) %>%
rename_with( ~ LETTERS[1:length(df)])
data <- as_tibble(df) %>%
mutate(across(everything(), ~ as.character(.))) %>%
rename_with(~ LETTERS[1:length(df)])
bind_rows(names, data)
# A tibble: 3 × 3
# A B C
# <chr> <chr> <chr>
# 1 shootings_per100k lawtotal felony
# 2 8105.475 3.001371 0.7874187
# 3 6925.427 0.9035228 0.4095783

Pivoting CreateTableOne in R to show levels as column headers?

I'm trying to generate some descriptive summary tables in R using the CreateTableOne function. I have a series of variables that all have the same response options/levels (Yes or No), and want to generate a wide table where the levels are column headings, like this:
Variable
Yes
No
Var1
1
7
Var2
5
2
But CreateTableOne generates nested long tables, with one column for Level where Yes and No are values in rows, like this:
Variable
Level
Value
Var1
Yes
1
Var1
No
7
Is there a way to pivot the table to get what I want while still using this function, or is there a different function I should be using instead?
Here is my current code:
vars <- c('var1', 'var2')
Table <- CreateTableOne(vars=vars, data=dataframe, factorVars=vars)
Table_exp <- print(Table, showAllLevels = T, varLabels = T, format="f", test=FALSE, noSpaces = TRUE, printToggle = FALSE)
write.csv(Table_exp, file = "Table.csv")
Thanks!
You could use only the pivot_wider to make that table. Here is your data:
library(tidyverse)
dataframe = data.frame(Variable = c("Var1", "Var1", "Var2", "Var2"),
Level = c("Yes", "No", "Yes", "No"),
Value = c(1, 7, 5, 2))
Your data:
Variable Level Value
1 Var1 Yes 1
2 Var1 No 7
3 Var2 Yes 5
4 Var2 No 2
You can use this code to make the wider table:
dataframe %>%
pivot_wider(names_from = "Level", values_from = "Value")
Output:
# A tibble: 2 × 3
Variable Yes No
<chr> <dbl> <dbl>
1 Var1 1 7
2 Var2 5 2
So I got the answer to this question from a coworker, and it's very similar to what Quinten suggested but with some additional steps to account for the structure of my raw data.
The example tables I provided in my question were my desired outputs, not examples of my raw data. The number values weren't values in my dataset, but rather calculated counts of records, and the solution below includes steps for doing that calculation.
This is what my raw data looks like, and it's actually structured wide:
Participant_ID
Var1
Var2
Age
1
Yes
No
20
2
No
No
30
We started by creating a subset with just the relevant variables:
subset <- data |> select(Participant_Id, Var1, Var2)
Then pivoted the data longer first, in order to calculate the counts I wanted in my output table. In this code, we specify that we don't want to pivot Participant_ID and create columns called Vars and Response.
subsetlong <- subset |> pivot_longer(-c("Participant_Id"), names_to = "Vars", values_to="Response")
This is what subsetlong looks like:
Participant_ID
Vars
Response
1
Var1
Yes
1
Var2
No
2
Var1
No
2
Var2
No
Then we calculated the counts by Vars, putting that into a new dataframe called counts:
counts <- subsetlong |> group_by(Vars) |> count(Response)
And this is what counts looks like:
Vars
Response
n
Var1
Yes
1
Var1
No
7
Var2
Yes
5
Var2
No
2
Now that the calculation was done, we pivoted this back to wide again, specifying that any NAs should appear as 0s:
counts_wide <- counts |> pivot_wider(names_from="Response", values_from="n", values_fill = 0)
And finally got the desired structure:
Vars
Yes
No
Var1
1
7
Var2
5
2

R loop/automation calculating statistic comparing two specific means (groups within categories) repeatedly with specific pair combinations

I have a dataframe with three variables: Var1 (with values of A, B, and C), Var2 (with values of X and Y), and Metric (various numeric values). For every group in Var1 there exists multiple of each Var2 (unique by other variables but I am collapsing them here and are not relevant).
For every Var1 group I would like to compare the means of Metric between Var2 groups (X mean vs. Y mean) using a statistical test to determine if they are significantly different from each other (the exact test is not important but ideally the solution is agnostic). Doing this in isolation is easy and I have done so with filtering and TukeyHSD and selected my pairs of interest. With the below example I have filtered for 'A' in Var1 and get the p-value for X vs Y:
#Generating dataframe
set.seed(3)
df = data.frame(Var1 = sample(c("A","B","C"), 15, replace=TRUE),
Var2 = sample(c("X","Y"),15, replace=TRUE),
Metric = 1:15)
df
#Calculating significance between X and Y for A in Var1
stat = aov(Metric ~ Var2, data = subset(df, Var1 %in% c("A")))
summary(stat)
TukeyHSD(stat)
Ideally I would like a way to automate this so that the final result is in an easily accessible format or list of outputs in R such that I have a p-value for X vs Y for A, B, and C groups using some form of loop where I just have to input the desired Var1 groups of interest (as my actual dataset has much more than 3 groups for it).
If I understood correctly
#Generating dataframe
set.seed(3)
df = data.frame(Var1 = sample(c("A","B","C"), 100, replace=TRUE),
Var2 = sample(c("X","Y"),100, replace=TRUE),
Metric = 1:100)
library(tidyverse)
df %>%
group_nest(Var1) %>%
mutate(AOV = map(data, ~aov(Metric ~ Var2, data = .x))) %>%
transmute(Tukey_HSD = map(AOV, TukeyHSD) %>% map(broom::tidy)) %>%
unnest(Tukey_HSD)
#> # A tibble: 3 x 7
#> term contrast null.value estimate conf.low conf.high adj.p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Var2 Y-X 0 0.931 -20.6 22.4 0.930
#> 2 Var2 Y-X 0 -8.06 -27.3 11.2 0.400
#> 3 Var2 Y-X 0 8.90 -14.1 31.9 0.435
Created on 2021-11-04 by the reprex package (v2.0.1)
with variable filtering
filter_vars <- c("B", "C")
df %>%
filter(Var1 %in% filter_vars) %>%
group_nest(Var1) %>%
mutate(AOV = map(data, ~aov(Metric ~ Var2, data = .x))) %>%
transmute(Tukey_HSD = map(AOV, TukeyHSD) %>% map(broom::tidy)) %>%
unnest(Tukey_HSD)

Vectorising linear interpolation function for use with mutate

I have a data frame that looks like this:
# Set RNG
set.seed(33550336)
# Create toy data frame
df <- expand.grid(day = 1:10, dist = seq(0, 100, by = 10))
df1 <- df %>% mutate(region = "Here")
df2 <- df %>% mutate(region = "There")
df3 <- df %>% mutate(region = "Everywhere")
df_ref <- do.call(rbind, list(df1, df2, df3))
df_ref$value <- runif(nrow(df_ref))
# > head(df_ref)
# day dist region value
# 1 1 0 Here 0.39413117
# 2 2 0 Here 0.44224203
# 3 3 0 Here 0.44207487
# 4 4 0 Here 0.08007335
# 5 5 0 Here 0.02836093
# 6 6 0 Here 0.94475814
This represents a reference data frame and I'd like to compare observations against it. My observations are taken on a specific day that is found in this reference data frame (i.e., day is an integer from 1 to 10) in a region that is also found in this data frame (i.e., Here, There, or Everywhere), but the distance (dist) is not necessarily an integer between 0 and 100. For example, my observation data frame (df_obs) might look like this:
# Observations
df_obs <- data.frame(day = sample(1:10, 3, replace = TRUE),
region = sample(c("Here", "There", "Everywhere")),
dist = runif(3, 0, 100))
# day region dist
# 1 6 Everywhere 68.77991
# 2 7 There 57.78280
# 3 10 Here 85.71628
Since dist is not an integer, I can't just lookup the value corresponding to my observations in df_ref like this:
df_ref %>% filter(day == 6, region == "Everywhere", dist == 68.77991)
So, I created a lookup function that uses the linear interpolation function approx:
lookup <- function(re, di, da){
# Filter to day and region
df_tmp <- df_ref %>% filter(region == re, day == da)
# Approximate answer from distance
approx(unlist(df_tmp$dist), unlist(df_tmp$value), xout = di)$y
}
Applying this to my first observation gives,
lookup("Everywhere", 68.77991, 6)
#[1] 0.8037013
Nevertheless, when I apply the function using mutate I get a different answer.
df_obs %>% mutate(ref = lookup(region, dist, day))
# day region dist ref
# 1 6 Everywhere 68.77991 0.1881132
# 2 7 There 57.78280 0.1755198
# 3 10 Here 85.71628 0.1730285
I suspect that this is because lookup is not vectorised correctly. Why am I getting different answers and how do I fix my lookup function to avoid this?

How to quickly create multiple summary tables with group_by() / summarise()?

I have a data frame with N vars, M categorical and 2 numeric. I would like to create M data frames, one for each categorical variable.
Eg.,
data %>%
group_by(var1) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
data %>%
group_by(varM) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
etc...
Is there a way to iterate through the categorical variables and generate each of the summary tables? That is, without needing to repeat the above chunks M times.
Alternatively, these summary tables don't have to be individual objects, as long as I can easily reference / pull the summaries for each of the M variables.
Here is a solution (I hope). Creates a list of data frames with the formula you have:
library(tidyverse)
# Create sample data frame
data <- data.frame(var1 = sample(1:2, 5, replace = T),
var2 = sample(1:2, 5, replace = T),
var3 = sample(1:2, 5, replace = T),
varM = sample(1:2, 5, replace = T),
var5 = rnorm(5, 3, 6),
var6 = rnorm(5, 3, 6))
# Vars to be grouped (var1 until varM in this example)
vars_to_be_used <- names(select(data, var1:varM))
# Function to be used
group_fun <- function(x, .df = data) {
.df %>%
group_by_(.x) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
}
# Loop over vars
results <- map(vars_to_be_used, group_fun)
# Nice list names
names(results) <- vars_to_be_used
print(results)
You didn't supply a sample data.set so I created a small example to show how it works.
data <- data_frame(var1 = rep(letters[1:5], 2),
var2 = rep(LETTERS[11:15], 2),
var3 = 1:10,
var4 = 11:20)
A combination of tidyverse packages can get you where you need to be.
Steps used: First we gather all the columns we want to group by on in a cols column and keep the numeric vars separate. Next we split the data.frame in a list of data.frames so that every column we want to group by on has it's own table with the 2 numeric vars. Now that everything is in a list, we need to use the map functionality from the purrr package. Using map, we spread the data.frame again so the column names are as we expect them to be. Finally using map we use group_by_if to group by on the character column and summarise the rest. All the outcomes are stored in a list where you can access what you need.
Run the code in pieces to see what every step does.
library(dplyr)
library(purrr)
library(tidyr)
outcomes <- data %>%
gather(cols, value, -c(var3, var4)) %>%
split(.$cols) %>%
map(~ spread(.x, cols, value)) %>%
map(~ group_by_if(.x, is.character) %>%
summarise(sumvar3 = sum(var3),
meanvar4 = mean(var4)))
outcomes
$`var1`
# A tibble: 5 x 3
var1 sumvar3 meanvar4
<chr> <int> <dbl>
1 a 7 13.5
2 b 9 14.5
3 c 11 15.5
4 d 13 16.5
5 e 15 17.5
$var2
# A tibble: 5 x 3
var2 sumvar3 meanvar4
<chr> <int> <dbl>
1 K 7 13.5
2 L 9 14.5
3 M 11 15.5
4 N 13 16.5
5 O 15 17.5

Resources