Creating new column based on cluster in R - r

Dear Stack overlow users...
I am struggling with using R. I did not frequently use it but used stata instead..
My data set has several clusters
What I wanna do is making new cluster columns with the value
so the clusters will be clolumn and each column has value.
Many thanks in advance

If there exist equal number of values per cluster, using dummy data,
library(tidyverse)
df <- data.frame(
value = rnorm(5),
cluster = c(1:4, 4)
)
n = max(table(df$cluster))
for (i in unique(df$cluster)) {
m = n - nrow(df[df$cluster == i,])
if (m > 0){
df <- rbind(df, setNames(as.data.frame(matrix(rep(c(NA, i),m), ncol = 2, byrow = T)), names(df)))
}
}
df %>%
group_by(cluster) %>%
mutate(n = 1:n()) %>%
pivot_wider(names_from = cluster, values_from = value) %>%
select(-n)
`1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl>
1 -0.0549 0.250 0.618 -0.173
2 NA NA NA -2.22

Related

Using dplyr::summarise with dplyr::across and purrr::map to sum across columns with the same prefix

I have a data frame where I want to sum column values with the same prefix to produce a new column. My current problem is that it's not taking into account my group_by variable and returning identical values. Is part of the problem the .cols variable I'm selecting in the across function?
Sample data
library(dplyr)
library(purrr)
set.seed(10)
dat <- data.frame(id = rep(1:2, 5),
var1.pre = rnorm(10),
var1.post = rnorm(10),
var2.pre = rnorm(10),
var2.post = rnorm(10)
) %>%
mutate(index = id)
var_names = c("var1", "var2")
What I've tried
sumfunction <- map(
var_names,
~function(.){
sum(dat[glue("{.x}.pre")], dat[glue("{.x}.post")], na.rm = TRUE)
}
) %>%
setNames(var_names)
dat %>%
group_by(id) %>%
summarise(
across(
.cols = index,
.fns = sumfunction,
.names = "{.fn}"
)
) %>%
ungroup
Desired output
For this and similar problems I made the 'dplyover' package (it is not on CRAN). Here we can use dplyover::across2() to loop over two series of columns, first, all columns ending with "pre" and second all columns ending with "post". To get the names correct we can use .names = "{pre}" to get the common prefix of both series of columns.
library(dplyr)
library(dplyover) # https://timteafan.github.io/dplyover/
dat %>%
group_by(id) %>%
summarise(across2(ends_with("pre"),
ends_with("post"),
~ sum(c(.x, .y)),
.names = "{pre}"
)
)
#> # A tibble: 2 × 3
#> id var1 var2
#> <int> <dbl> <dbl>
#> 1 1 -2.32 -5.55
#> 2 2 1.11 -9.54
Created on 2022-12-14 with reprex v2.0.2
Whenever operations across multiple columns get complicated, we could pivot:
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(-c(id, index),
names_to = c(".value", "name"),
names_sep = "\\.") %>%
group_by(id) %>%
summarise(var1 = sum(var1), var2=sum(var2))
id var1 var2
<int> <dbl> <dbl>
1 1 -2.32 -5.55
2 2 1.11 -9.54

Summarize information by group in data table in R

I'm trying to get multiple summary statistics in R grouped by Team. I used code like below, but output is not what I want.
please point me in a better direction. Thanks!
set.seed(77)
data <- data.frame(Team =sample(c("A","B"),30, replace=TRUE),
gender=sample(c("female","male"),30, replace=TRUE),
Age =sample(c(0:100),30, replace=T))
dat <- data %>%
group_by(Team, gender) %>%
dplyr::summarize_all(list(my_mean = mean,
my_sum = sum,
my_sd = sd)) %>%
as.data.frame()
df <- data %>%
group_by(Team) %>%
summarize(total = n(gender),
mean = mean(Age),
Max_Age = max(Age),
Min_Age = min(Age),
sd = sd(Age),
)
I want to get like this pic.
You may need to create the dataframe for the summary statistics of age per Team (age_summary in the example below) and that for the count of Team members per gender and Team (gender_summary in the example below), and then merge them into one dataframe (say summary_df).
library(tidyverse)
set.seed(77)
data <- data.frame(
Team = sample(c("A", "B"), 30, replace = TRUE),
gender = sample(c("female", "male"), 30, replace = TRUE),
Age = sample(c(0:100), 30, replace = T)
)
age_summary <- data %>%
group_by(Team) %>%
summarize(
mean = mean(Age),
Max = max(Age),
Min = min(Age),
sd = sd(Age)
) %>%
column_to_rownames("Team") %>%
t() %>%
as_tibble(
rownames = "age_summary"
)
gender_summary <- data %>%
group_by(Team) %>%
count(gender) %>%
ungroup() %>%
pivot_wider(names_from = Team, values_from = n)
summary_df <- full_join(
age_summary,
gender_summary
) %>%
mutate(
"item" = if_else(
is.na(gender),
"Age",
"Sex"
)
) %>%
unite("summary", c(age_summary, gender), na.rm = TRUE, remove = FALSE) %>%
relocate(item, .before = 1) %>%
select(-c(age_summary, gender))
# # A tibble: 6 × 4
# item summary A B
# <chr> <chr> <dbl> <dbl>
# 1 Age mean 45.6 57.8
# 2 Age Max 92 82
# 3 Age Min 5 14
# 4 Age sd 30.1 22.1
# 5 Sex female 8 9
# 6 Sex male 7 6

i want to write a custom function with tidyverse verbs/syntax that accepts the grouping parameters of my function as string

I want to write a function that has as parameters a data set, a variable to be grouped, and another parameter to be filtered. I want to write the function in such a way that I can afterwards apply map() to it and pass the variables to be grouped in to map() as a vector. Nevertheless, I don't know how my custom function rating() accepts the variables to be grouped as a string. This is what i have tried.
data = tibble(a = seq.int(1:10),
g1 = c(rep("blue", 3), rep("green", 3), rep("red", 4)),
g2 = c(rep("pink", 2), rep("hotpink", 6), rep("firebrick", 2)),
na = NA,
stat=c(23,43,53,2,43,18,54,94,43,87))
rating = function(data, by, no){
data %>%
select(a, {{by}}, stat) %>%
group_by({{by}}) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
fn(data = data, by = g2, no = 5) #this works
And this is the way i want to use my function
map(.x = c("g1", "g2"), .f = ~rating(data = data, by = .x, no = 1))
... but i get
Error: Must group by variables found in `.data`.
* Column `.x` is not found.
As we are passing character elements, it would be better to convert to symbol and evaluate (!!)
library(dplyr)
library(purrr)
rating <- function(data, by, no){
by <- rlang::ensym(by)
data %>%
select(a, !! by, stat) %>%
group_by(!!by) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
-testing
> map(.x = c("g1", "g2"), .f = ~rating(data = data, by = !!.x, no = 1))
[[1]]
# A tibble: 1 × 4
a g1 stat rank
<int> <chr> <dbl> <dbl>
1 1 blue 23 1
[[2]]
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 1 pink 23 1
It also works with unquoted input
> rating(data, by = g2, no = 5)
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 5 hotpink 43 3

R: How to perform multiple t.Tests when variable pairs contain NAs throughout?

R doesn't perform a t.test when there are too few observations. However, I need to compare two surveys, where one survey has information on all items, whereas in the other it is lacking in some variables. This leads to a t.test comparison of e.g. q1 from NA (group 1) with values (group 2).
Basically, I need to find out how the t.test is performed anyway but reports an error if the requirements are lacking. I need to perform multiple t.tests at the same time (q1-q4) with grouping variable group and report the p.values to an output file.
Thanks for your help!
#create data
surveydata <- as.data.frame(replicate(1,sample(1:5,1000,rep=TRUE)))
colnames(surveydata)[1] <- "q1"
surveydata$q2 <- sample(6, size = nrow(surveydata), replace = TRUE)
surveydata$q3 <- sample(6, size = nrow(surveydata), replace = TRUE)
surveydata$q4 <- sample(6, size = nrow(surveydata), replace = TRUE)
surveydata$group <- c(1,2)
#replace all value "6" wir NA
surveydata[surveydata == 6] <- NA
#add NAs to group 1 in q1
surveydata$q1[which(surveydata$q1==1 & surveydata$group==1)] = NA
surveydata$q1[which(surveydata$q1==2 & surveydata$group==1)] = NA
surveydata$q1[which(surveydata$q1==3 & surveydata$group==1)] = NA
surveydata$q1[which(surveydata$q1==4 & surveydata$group==1)] = NA
surveydata$q1[which(surveydata$q1==5 & surveydata$group==1)] = NA
#perform t.test
svy_sel <- c("q1", "q2", "q3", "q4", "group") #vector for selection
temp <- surveydata %>%
dplyr::select(svy_sel) %>%
tidyr::gather(key = variable, value = value, -group) %>%
dplyr::mutate(value = as.numeric(value)) %>%
dplyr::group_by(group, variable) %>%
dplyr::summarise(value = list(value)) %>%
tidyr::spread(group, value) %>% #convert from “long” to “wide” format
dplyr::group_by(variable) %>% #t-test will be applied to each member of this group (ie., each variable).
dplyr::mutate(p_value = t.test(unlist(1), unlist(2))$p.value, na.action = na.exclude)
Here's a base R way to get a tidy data frame of your results:
do.call(rbind, lapply(names(surveydata)[1:4], function(i) {
tryCatch({
test <- t.test(as.formula(paste(i, "~ group")), data = surveydata)
data.frame(question = i,
group1 = test$estimate[1],
group2 = test$estimate[2],
difference = diff(test$estimate),
p.value = test$p.value, row.names = 1)
}, error = function(e) {
data.frame(question = i,
group1 = NA,
group2 = NA,
difference = NA,
p.value = NA, row.names = 1)
})
}))
#> question group1 group2 difference p.value
#> 1 q1 NA NA NA NA
#> 11 q2 2.893720 3.128878 0.23515847 0.01573623
#> 12 q3 3.020930 3.038278 0.01734728 0.85905665
#> 13 q4 3.024213 3.066998 0.04278444 0.65910949
I'm not going to get into the debate about whether t tests are appropriate for Likert type data. I think the consensus is that with decent sized samples this should be OK.
You could also still do this with dplyr if you wrote a little function that would calculate the test if there was enough data. Here's the function that takes the entries from the dataset and calculates the p-value.
ttfun <- function(v1, v2, ...){
tmp <- data.frame(x = unlist(v1),
y = unlist(v2))
tmp <- na.omit(tmp)
if(nrow(tmp) < 2){
pv <- NA
}
else{
pv <- t.test(tmp$x,tmp$y, ...)$p.value
}
pv
}
Then, you could just call that in your last call to mutate():
svy_sel <- c("q1", "q2", "q3", "q4", "group") #vector for selection
temp <- surveydata %>%
dplyr::select(svy_sel) %>%
tidyr::gather(key = variable, value = value, -group) %>%
dplyr::mutate(value = as.numeric(value)) %>%
dplyr::group_by(group, variable) %>%
dplyr::summarise(value = list(value)) %>%
tidyr::spread(group, value) %>% #convert from “long” to “wide” format
dplyr::group_by(variable) %>% #t-test will be applied to each member of this group (ie., each variable).
dplyr::rename('v1'= '1', 'v2' = '2') %>%
dplyr::mutate(p_value = ttfun(v1, v2))
> temp
# # A tibble: 4 x 4
# # Groups: variable [4]
# variable v1 v2 p_value
# <chr> <list> <list> <dbl>
# 1 q1 <dbl [500]> <dbl [500]> NA
# 2 q2 <dbl [500]> <dbl [500]> 0.724
# 3 q3 <dbl [500]> <dbl [500]> 0.549
# 4 q4 <dbl [500]> <dbl [500]> 0.355

R package "infer" - Iterative bootstrapping / looping over column names

I'm bootstrapping with the infer package.
The statistic of interest is the mean, example data is given by a tibble with 3 columns and 5 rows. My real tibble has 86 rows and 40 columns. For every column I want to do a bootstrap simulation, like shown below for the column "x" in tibble "test_tibble".
library(infer)
library(tidyverse)
test_tibble <- tibble(x = 1:5, y = 6:10, z = 11:15)
# A tibble: 5 x 3
x y z
<int> <int> <int>
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
specify(test_tibble, response = x) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
# A tibble: 1 x 2
lower_CI upper_CI
<dbl> <dbl>
1 2.10 4
I am now looking for a way of doing the same thing for the other columns in my tibble. I have tried a for-loop like this:
for (i in 1:ncol(test_tibble)){
var_name <- names(test_tibble)[i]
specify(test_tibble, response = var_name) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
}
Unfortunately, this returns the follwing error
Error: The response variable `var_name` cannot be found in this dataframe.
Is there any way of iterating over the columns x, y and z without entering them manually as arguments for "response"? That'd be quite tedious for 40 columns.
This is a tricky question with a tricky answer.
Take a look at the response argument of the specify function in documentation:
The variable name in x that will serve as the response. This is alternative to using the formula argument.
With this in mind I modified the code to automate the process, adding one more column to the original dataframe and using the formula argument to obtain the same result, using a column of ones as explanatory variable.
library(infer)
library(tidyverse)
test_tibble <- tibble(x = 1:5, y = 6:10, z = 11:15, w = seq(1, 1, length.out = 5))
for (i in 1:ncol(test_tibble)){
var_name <- names(test_tibble)[i]
specify(test_tibble, formula = eval(parse(text = paste0(var_name, "~", "w"))))[, 1] %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
}
Hope it helps

Resources