Apply t.test on a tidy format data - r

I have a data frame in tidy format as follows:
df <- data.frame(name = c("A", "C", "B", "A", "B", "C", "D") ,
group = c(rep("case", 3), rep("cntrl", 4)),
mean = rnorm(7, 0,1))
I would like to group the data by two variables name and group and apply a t.test on mean value of each category. For example doing t.test between A_case.vs.A_cntrl and add pvalue as the result to the table.
Do you have any idea how can I do this using tidyverse package?
Thanks,

here, a group wise, t.test on 'name' cannot be performed as there is only a single observation for each pair. Instead, we can do
library(dplyr)
df %>%
summarise(ttest = list(t.test(mean[group == 'case'],
mean[group == 'cntrl']))) %>%
pull(ttest)
Update
If we need to create a column, use mutate
df %>%
mutate(pval = t.test(mean[group == 'case'],
mean[group == 'cntrl'])$p.value)
Or reshape to 'wide' format and then do the t.test on the columns
library(tidyr)
df %>%
pivot_wider(names_from = group, values_from = mean) %>%
summarise(ttest = list(t.test(case, cntrl))) %>%
pull(ttest)

Related

How do you use forcat's fct_lump_min() function on a factor while keeping another identifiying factor?

Lets consider this dummy dataset:
v1<- c("A","B", "C", "D", "E", "F")
v2<- c("Z","Y", "X", "X", "V", "U")
Count<- c(2, 5, 10, 5, 1)
df<- cbind.data.frame(v1, v2, Count)
I want to use fct_lump_min() to lump all v1 factors that have a count of 2 or less into another factor named "unique". If I were to completely disregard the V2 factor column, I have functional code like this:
df<-df %>%
mutate(CombinedDGSequence = fct_lump_min(v1, 2, Count, other_level = "Unique")) %>%
count(CombinedDGSequence, wt = Count, name = "Count")
However, doing so removes the corresponding v2 factor column completely. Is there any way I can maintain each v1 factor level's corresponding v2 value in the resulting dataframe after using fct_lump_min?
Thanks guys!
We may need add_count which creates a new column instead of summarizing
library(dplyr)
library(forcats)
df %>%
mutate(CombinedDGSequence = fct_lump_min(v1, 2, Count,
other_level = "Unique")) %>%
add_count(CombinedDGSequence, wt = Count, name = "Count")
You may try this to combine all the v2 values in one string.
library(dplyr)
library(forcats)
df %>%
mutate(v1 = fct_lump_min(v1, 2, Count, other_level = "Unique")) %>%
group_by(v1) %>%
summarise(v2 = toString(v2),
Count = sum(Count))

Creating duplicate in R

I have the following input data frame with 4 columns and 3 rows.
The time column can take value from 1 to the corresponding value of the maturity column for that customer, I want to create more observations for each customer till the value of time is = value of maturity, with the other columns retaining their original value. Please see the below links for input and expected output
Input
Output
Here is a dplyr solution inspired but not exactly equal to this post.
library(dplyr)
df <- data.frame(custno = 1:3, time = 1, dept = c("A", "B", "A"))
df %>%
slice(rep(1:n(), each = 5)) %>%
group_by(custno) %>%
mutate(time = seq_along(time))
Edit
After the comments by the OP, the following seems to be better.
First, the data:
df <- data.frame(custno = 1:3, time = 1,
dept = c("A", "B", "A"),
maturity = c(5,4,6))
And the solution.
df %>%
tidyr::uncount(maturity) %>%
group_by(custno) %>%
mutate(time = seq_along(time))
We can also use slice with row_number
library(dplyr)
library(data.table)
df %>%
slice(rep(row_number(), maturity)) %>%
mutate(time = rowid(custno))
data
df <- data.frame(custno = 1:3, time = 1,
dept = c("A", "B", "A"),
maturity = c(5,4,6))

Concatenating Row[-1,-3] in the Tidyverse

I'm new to the Tidyverse and dplyr and was hoping to get some guidance on how best to concatenate data from row below the current row. For example, in the dataframe below I want to use data in the Grade column to create the data in the Prior3Grades column. The Prior3Grades data for 1/2/2019 would be created by concatenating the grades from 12/3/18, 11/3/18 and 10/4/18.
Can this be achieved in dplyr using mutate or some other means? Also is this in dplyr's wheelhouse or would this be something better suited to sql.
Using some basic packages from the tidyverse:
library(dplyr)
library(tidyr)
library(tibble)
df <- tibble(
Name = "Bob",
TestDate = seq(as.Date("2019-02-01"), as.Date("2019-05-08"), length.out = 6), ## some random dates
Grade = c("A", "A", "B", "C", "D", "A")
)
df %>%
group_by(Name) %>%
mutate(
grade1 = lead(Grade),
grade2 = lead(Grade, 2),
grade3 = lead(Grade, 3)
) %>%
replace_na(list(grade1 = "", grade2 = "", grade3 = "")) %>%
mutate(
Prior3Grades = paste0(grade1, grade2, grade3)
)

Calculation on every pair from grouped data.frame

My question is about performing a calculation between each pair of groups in a data.frame, I'd like it to be more vectorized.
I have a data.frame that has a consists of the following columns: Location , Sample , Var1, and Var2. I'd like to find the closet match for each Sample for each pair of Locations for both Var1 and Var2.
I can accomplish this for one pair of locations as such:
df0 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(1:30), times =3),
Var1 = sample(1:25, 90, replace =T),
Var2 = sample(1:25, 90, replace=T))
df00 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(31:60), times =3),
Var1 = sample(1:100, 90, replace =T),
Var2 = sample(1:100, 90, replace=T))
df000 <- rbind(df0, df00)
df <- sample_n(df000, 100) # data
dfl <- df %>% gather(VAR, value, 3:4)
df1 <- dfl %>% filter(Location == "A")
df2 <- dfl %>% filter(Location == "B")
df3 <- merge(df1, df2, by = c("VAR"), all.x = TRUE, allow.cartesian=TRUE)
df3 <- df3 %>% mutate(DIFF = abs(value.x-value.y))
result <- df3 %>% group_by(VAR, Sample.x) %>% top_n(-1, DIFF)
I tried other possibilities such as using dplyr::spread but could not avoid the "Error: Duplicate identifiers for rows" or columns half filled with NA.
Is there a more clean and automated way to do this for each possible group pair? I'd like to avoid the manual subset and merge routine for each pair.
One option would be to create the pairwise combination of 'Location' with combn and then do the other steps as in the OP's code
library(tidyverse)
df %>%
# get the unique elements of Location
distinct(Location) %>%
# pull the column as a vector
pull %>%
# it is factor, so convert it to character
as.character %>%
# get the pairwise combinations in a list
combn(m = 2, simplify = FALSE) %>%
# loop through the list with map and do the full_join
# with the long format data df1
map(~ full_join(df1 %>%
filter(Location == first(.x)),
df1 %>%
filter(Location == last(.x)), by = "VAR") %>%
# create a column of absolute difference
mutate(DIFF = abs(value.x - value.y)) %>%
# grouped by VAR, Sample.x
group_by(VAR, Sample.x) %>%
# apply the top_n with wt as DIFF
top_n(-1, DIFF))
Also, as the OP mentioned about automatically picking up instead of doing double filter (not clear about the expected output though)
df %>%
distinct(Location) %>%
pull %>%
as.character %>%
combn(m = 2, simplify = FALSE) %>%
map(~ df1 %>%
# change here i.e. filter both the Locations
filter(Location %in% .x) %>%
# spread it to wide format
spread(Location, value, fill = 0) %>%
# create the DIFF column by taking the differene
mutate(DIFF = abs(!! rlang::sym(first(.x)) -
!! rlang::sym(last(.x)))) %>%
group_by(VAR, Sample) %>%
top_n(-1, DIFF))

In nested data frame, pass information from one list column to function applied in another

I am working on a report for which I have to export a large number of similar data frames into nice looking tables in Word. My goal is to achieve this in one go, using flextable to generate the tables and purrr / tidyverse to apply all the formatting procedures to all rows in a nested data frame. This is what my data frame looks like:
df <- data.frame(school = c("A", "B", "A", "B", "A", "B"),
students = c(round(runif(6, 1, 10), 0)),
grade = c(1, 1, 2, 2, 3, 3))
I want to generate separate tables for all groups in column 'school' and started by using the nest() function within tidyr.
list <- df %>%
group_by(school) %>%
nest()
This gives me a nested data frame to which I can apply the functions in flextable using purrr:
list <- list %>%
mutate(ftables = map(data, flextable)) %>%
mutate(ftables = purrr::map(ftables, ~ set_header_labels(.,
students = "No of students",
grade = "Grade")))
The first mutate generates a new column with flextable objects for each school, and the second mutate applies header labels to the table, based on the column names that are saved in the object.
My goal is now to add another header that is based on the name of the school. This value resides in the list column entitled school, which corresponds row-wise to the tables generated in the list column ftables. How can I pass the name of the school to the add_header function within ftables, using purrr or any other procedure?
Expected output
I have been able to achieve what I want for individual schools with this procedure (identical header cells will later be merged):
school.name <- "A"
ftable.a <- df %>%
filter(school == "A") %>%
select(-school) %>%
flextable() %>%
set_header_labels(students = "No of students",
grade = "Grade") %>%
add_header(students = school.name,
grade = school.name)
ftable.a
package purrr provides function map2 that you should use:
library(flextable)
library(magrittr)
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(school = c("A", "B", "A", "B", "A", "B"),
students = c(round(runif(6, 1, 10), 0)),
grade = c(1, 1, 2, 2, 3, 3))
byschool <- df %>%
group_by(school) %>%
nest()
byschool <- byschool %>%
mutate(ftables = map(data, flextable)) %>%
mutate(ftables = purrr::map(
ftables, ~ set_header_labels(.,
students = "No of students",
grade = "Grade"))) %>%
mutate(ftables = purrr::map2(ftables, school, function(ft, h){
add_header(ft, students = h, grade = h)
} ))

Resources