Writing a function to summarize the results of dunn.test::dunn.test - r

In R, I perform dunn's test. The function I use has no option to group the input variables by their statistical significant differences. However, this is what I am genuinely interested in, so I tried to write my own function. Unfortunately, I am not able to wrap my head around it. Perhaps someone can help.
I use the airquality dataset that comes with R as an example. The result that I need could look somewhat like this:
> library (tidyverse)
> ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
# A tibble: 5 x 2
Month Mean
<int> <dbl>
1 5 23.6
2 6 29.4
3 7 59.1
4 8 60.0
5 9 31.4
When I run the dunn.test, I get the following:
> dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 0
Comparison of x by group
(Benjamini-Hochberg)
Col Mean-|
Row Mean | 5 6 7 8
---------+--------------------------------------------
6 | -0.925158
| 0.4436
|
7 | -4.419470 -2.244208
| 0.0001* 0.0496*
|
8 | -4.132813 -2.038635 0.286657
| 0.0002* 0.0691 0.8604
|
9 | -1.321202 0.002538 3.217199 2.922827
| 0.2663 0.9980 0.0043* 0.0087*
alpha = 0.05
Reject Ho if p <= alpha
From this result, I deduce that May differs from July and August, June differs from July (but not from August) and so on. So I'd like to append significantly differing groups to my results table:
# A tibble: 5 x 3
Month Mean Group
<int> <dbl> <chr>
1 5 23.6 a
2 6 29.4 ac
3 7 59.1 b
4 8 60.0 bc
5 9 31.4 a
While I did this by hand, I suppose it must be possible to automate this process. However, I don't find a good starting point. I created a dataframe containing all comparisons:
> ozone_differences <- dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
> ozone_differences <- data.frame ("P" = ozone_differences$altP.adjusted, "Compare" = ozone_differences$comparisons)
P Compare
1 4.436043e-01 5 - 6
2 9.894296e-05 5 - 7
3 4.963804e-02 6 - 7
4 1.791748e-04 5 - 8
5 6.914403e-02 6 - 8
6 8.604164e-01 7 - 8
7 2.663342e-01 5 - 9
8 9.979745e-01 6 - 9
9 4.314957e-03 7 - 9
10 8.671708e-03 8 - 9
I thought that a function iterating through this data frame and using a selection variable to choose the right letter from letters() might work. However, I cannot even think of a starting point, because changing numbers of rows have to considered at the same time...
Perhaps someone has a good idea?

Perhaps you could look into cldList() function from rcompanion library, you can pipe the res results from the output od dunnTest() and create a table that specifies the compact letter display comparison per group.

Following the advice of #TylerRuddenfort , the following code will work. The first cld is created with rcompanion::cldList, and the second directly uses multcompView::multcompLetters. Note that to use multcompLetters, the spaces have to be removed from the names of the comparisons.
Here, I have used FSA:dunnTest for the Dunn test (1964).
In general, I recommend ordering groups by e.g. median or mean before running e.g. dunnTest if you plan on using a cld, so that the cld comes out in a sensible order.
library (tidyverse)
ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
library(FSA)
Result = dunnTest(airquality$Ozone, airquality$Month, method = "bh")$res
### Use cldList()
library(rcompanion)
cldList(P.adj ~ Comparison, data=Result)
### Use multcompView
library(multcompView)
X = Result$P.adj <= 0.05
names(X) = gsub(" ", "", Result$Comparison)
multcompLetters(X)

Related

How do I change numeric values in a subset of columns in a R dataframe to other numeric values?

I have a dataset with currently 4 rows /subjects (more to come as this is ongoing research) and 259 variables /columns. 240 variables of this dataset are ratings of fit ("How well does the following adjective match the dimension X?" and 19 variables are sociodemographic.
For these 240 rating-variables, my subjects could give a rating ranging from 1 ("fits very badly") to 7 ("fits very well"). Consequently, I have a 240 variables numbered from 1 to 7. I would like to change these numeric values as follows (the procedure being the same for all of the 240 columns)
1 should change to 0, 2 should change to 1/6, 3 should change to 2/6, 4 should change to 3/6, 5 should change to 4/6, 6 should change to 5/6 and 7 should change to 1. So no matter where in the 240 columns, a 1 should change to 0 and so on.
I have tried the following approaches:
Recode numeric values in R
In this post, it says that
x <- 1:10
# With recode function using backquotes as arguments
dplyr::recode(x, `2` = 20L, `4` = 40L)
# [1] 1 20 3 40 5 6 7 8 9 10
# With case_when function
dplyr::case_when(
x %in% 2 ~ 20,
x %in% 4 ~ 40,
TRUE ~ as.numeric(x)
)
# [1] 1 20 3 40 5 6 7 8 9 10
Consequently, I tried this:
df = ds %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20)
%>% recode(.,`1`=0,`2`=-1/6,`3`=-2/6, `4`=3/6,`5`=4/6, `6`=5/6, `7`=1))
with AD01_01 etc. being the column names for the adjectives my subjects should rate. I also tried it without the ., after recode(, to no avail.
This code is flawed because it omits the 19 rows of sociodemographic data I want to keep in my dataset. Moreover, I get the error unexpected SPECIAL in "%>%".
I thought R might accept my selected columns with the pipe operator as the "x" in recode. Apparently, this is not the case. I also tried to read up on the R documentation of recode but it made things much more confusing for me, as there were a lot of technical terms I don't understand.
As there is another option mentioned in the post, I also tried this:
df = df %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20) %>% case_when (.,%in% 1~0,%in% 2~1/6,%in%3~2/6,%in%4~3/6,%in%5~4/6,%in%6~5/6,%in%7~1)
I thought I could give the output of the select function to the case_when function. Apparently, this is also not the case.
When I execute this command, I get
Error: unexpected SPECIAL in:
"df = df %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20) %>% case_when (%in%"
Reading up on other possibilities, I found this
https://rstudio-education.github.io/hopr/modify.html
exemplary dataset:
head(dplyr::storms)
## # A tibble: 6 x 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## # ... with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
# We decide that we want to recode all NAs to 9999.
storm <- storms
storm$ts_diameter[is.na(storm$ts_diameter)] <- 9999
summary(storm$ts_diameter)
ds$AD01_01:AD01_20[1(ds$AD01_01:AD01_20)] <- 0, ds$AD01_01:AD01_20[2(ds$AD01_01:AD01_20)] <- 1/6, ds$AD01_01:AD01_20[3(ds$AD01_01:AD01_20)] <- 2/6,
ds$AD01_01:AD01_20[4(ds$AD01_01:AD01_20)] <- 3/6, ds$AD01_01:AD01_20[5(ds$AD01_01:AD01_20)] <- 4/6, ds$AD01_01:AD01_20[6(ds$AD01_01:AD01_20)] <- 5/6,
ds$AD01_01:AD01_20[7(ds$AD01_01:AD01_20)] <- 1
My idea in this case was to use assign for multiple columns at a time (this effort just concerns 20 of my 240 columns and it also didn't work. I got the error
could not find function ":<-" which is weird because I thought this was a basic command. The only noteworthy thing that might explain is that I executed library(readr) and library(tidyverse) beforehand.
Disclaimer: I am an R newbie and have spent 2 hours to try to solve this issue. I would also like to know where I went wrong and why my code doesn't work.
How about using mutate(across())? For example, if all your "adjective rating" columns start with "AD", you can do something like this:
library(dplyr)
ds %>% mutate(across(starts_with("AD"), ~(.x-1)/6))
Explanation of where you went wrong with your code:
First, your select(...) %>% recode(...) was close. However, when you use select, you are reducing ds to only the selected columns, thus recoding those values and assigning to df will result in df not having the demographic variables.
Second, if you want to use recode you can, but you can't feed it an entire data frame/tibble, like you are doing when you pipe (%>%) the selected columns to it. Instead, you can use recode() iteratively in .fns, on each of the columns in the .cols param of across(), like this:
ds %>%
mutate(across(
.cols = starts_with("AD"),
.fns = ~recode(.x,`1`=0,`2`=-1/6,`3`=-2/6, `4`=3/6,`5`=4/6, `6`=5/6, `7`=1))
)

R: For a list of dfs, filter for a value in Column1, to extract mean and SD of another Column2 (only of rows with the filtered value in Column1)

I have a big dataset ('links_cl', each participant of a study has several 100 rows), which I need to subset into dfs, one for each participant.
For those 42 dfs, I then need to do the same operation again and again. After spending half a day trying to write my own function, trying to find a solution online, I now have to ask here.
So, I am looking for a way to
subset the huge dataset several times and have one in my environment for every participant, without using the same code 42 times. What I did so far 'by hand' is:
Subj01 <- subset(links_cl, Subj == 01, select = c("Condition", "ACC_LINK", "RT_LINK", "CONF_LINK", "ACC_SOURCE", "RT_SOURCE", "CONF_SOURCE"))
filter for Column 'Condition' (either == 1,2,3 or 4), and describe/get the mean and sd of 'RT_LINK', which I so far also did 'manually'.
Subj01 %>% filter(Condition == 01) %>% describe(Subj01$RT_LINK)
But here I just get the description of the whole df of Subj01, so I would have to find 4x41 means by hand. It would be great to just have an output with the means and SDs of every participant, but I have no idea where to start and how to tell R to do this.
I tried this, but it won't work:
subsetsubj <- function(x,y) {
Subj_x <- links_cl %>%
subset(links_cl,
Subj == x,
select = c("Condition", "ACC_LINK", "RT_LINK", "CONF_LINK", "ACC_SOURCE", "RT_SOURCE", "CONF_SOURCE")) %>%
filter(Condition == y) %>%
describe(Subj_x$RT_LINK)
}
I also tried putting all dfs into a List and work with that, but it lead to nowhere.
If there is a solution without the subsetting, that would also work. This just seemed a logical step to me. Any idea, any help how to solve it?
You don't really need to split the dataset up into one dataframe for each patient. I would recommend a standard group_by()/summarize() approach, like this:
links_cl %>%
group_by(Subj, Condition) %>%
summarize(mean_val = mean(RT_LINK),
sd_val = sd(RT_LINK))
Output:
Subj Condition mean_val sd_val
<int> <int> <dbl> <dbl>
1 1 1 0.0375 0.873
2 1 2 0.103 1.05
3 1 3 0.184 0.764
4 1 4 0.0375 0.988
5 2 1 -0.0229 0.962
6 2 2 -0.156 0.820
7 2 3 -0.175 0.999
8 2 4 -0.0763 1.12
9 3 1 0.272 1.02
10 3 2 0.0172 0.835
# … with 158 more rows
Input:
set.seed(123)
links_cl <- data.frame(
Subj = rep(1:42, each =100),
Condition = rep(1:4, times=4200/4),
RT_LINK = rnorm(4200)
)

Grouping and Running a For Loop in R

I'm looking to include a group statement within my for loop and I'm having difficulty finding any details into how to properly do this.
The example below , calculates the Extra, Outstanding and Current Column within my loop statement. I'm trying to group by id so that the loop will restart with every id. My current code:
dat <- tibble(
id = c("A","A","A","A","A","A","B","B"),
rn= c(1,2,3,4,5,6,1,2),
current = c(100,0,0,0,0,0,500,0),
paid = c(10,12,12,13,13,13,20,20),
pct_extra = c(.02,.05,.05,.07, .03, .01, .09,.01),
Extra = NA,
Outstanding = NA)
for(i in 1:nrow(dat)){
dat$Extra[i] <- dat$current[i]*dat$pct_extra[i]
dat$Outstanding[i] <- dat$current[i] - dat$paid[i] - dat$Extra[i]
if(i < nrow(dat)){
dat$current[(i+1)] <- dat$Outstanding[i]}}
I saw other posts with this same question and they seem to revert to using dplyr. So my first attempt was:
for(i in 1:nrow(dat)){
dat%>%
group_by(id)%>%
mutate(Extra=pct_extra*(current-paid),
Outstanding=current-paid-Extra,
current=if_else(rn==1,current,lag(Outstanding)))}
But this attempt didnt actually calculate the Extra, Outstanding and current columns which my guess is because I'm not using the loop statement properly.
Does anyone have any suggestions/references on how I can include a group statement into my for loop?
Thanks!
A few things.
for loops (surrounding dplyr pipes) are generally not necessary with dplyr grouping, this is no exception (though we will use your for loop in a "single group at a time" way).
Even if it were, you loop with i and never use i, so you're doing the same calculation to all rows, nrow(dat) times.
Third, you aren't storing the results.
My first attempt (after realizing the rolling nature of this) was to try to adapt slider::slide to it, but unfortunately I couldn't get it to work.
In older-dplyr, I would dat %>% group_by(id) %>% do({...}), but they've superseded do in lieu of across and multi-row summarize (which I could not figure out how to make do this).
So then I realized that your for loop works fine, it just needs to be applied one group at a time.
func <- function(z) {
for (i in seq_len(nrow(z))) {
z$Extra[i] <- z$current[i]*z$pct_extra[i]
z$Outstanding[i] <- z$current[i] - z$paid[i] - z$Extra[i]
if (i < nrow(z)) {
z$current[(i+1)] <- z$Outstanding[i]
}
}
z
}
library(dplyr)
library(tidyr) # nest, unnest
library(purrr) # map, can be done with base::Map as well
dat %>%
group_by(id) %>%
nest(quux = -id) %>%
mutate(quux = map(quux, func)) %>%
unnest(quux) %>%
ungroup()
# # A tibble: 8 x 7
# id rn current paid pct_extra Extra Outstanding
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 100 10 0.02 2 88
# 2 A 2 88 12 0.05 4.4 71.6
# 3 A 3 71.6 12 0.05 3.58 56.0
# 4 A 4 56.0 13 0.07 3.92 39.1
# 5 A 5 39.1 13 0.03 1.17 24.9
# 6 A 6 24.9 13 0.01 0.249 11.7
# 7 B 1 500 20 0.09 45 435
# 8 B 2 435 20 0.01 4.35 411.

Create dummy variable with survey package

I want to transform a variable into a dummy using the survey package.
I have a complex sample design defined by:
library(survey)
prestratified_design <- svydesign(
id = ~ PSU ,
strata = ~ STRAT,
data = data,
weights = ~ weight ,
nest = TRUE)
The dataset has a variable for education with 8 different categories:
# A tibble: 8 x 3
education n prop
<int> <int> <dbl>
1 1 2919 20.8
2 2 5551 39.5
3 3 447 3.18
4 4 484 3.45
5 5 3719 26.5
6 6 91 0.65
7 9 790 5.63
8 10 39 0.28
I want to create a dummy variable for categories 5 & 10 == 1 and others == 0.
I know that I have to use the update function, but I don't know how to use if in the survey package.
I have tried:
prestratified_design <- update(
prestratified_design,
dummy_educ = as.numeric (education == 5 & education == 10)
but it obviously didn't work.
thank you!
You can create dummy variables in R via ifelse() if the number of categories is two.
df$dummy_educ = with(df, ifelse(education == 5 | education == 10, 1, 0))
If the categories are more, you can use dplyr::case_when(), and if you are creating dummies from factor variable model.matrix() is fast and the best.
In order any new variable takes in count the complex design, you don't need to update your data set (in your example data), but you have to update your survey design adding the new variable. You must use the survey::update() function.
Following your example, try with the code below:
prestratified_design <- update(prestratified_design,
dummy_educ = as.integer(education == 5 | education == 10))
Good luck with that!.

R loop through multiple sub groups with using functions

Hi I am trying to learn how to loop through multiple groups within a data frame and apply certain arithmetic operations. I do not have a programming background and am struggling to loop through the multiple conditions.
My data looks like the following:
Event = c(1,1,1,1,1,2,2,2,2,2)
Indiv1=c(4,5,6,11,45,66,8,9,32,45)
Indiv2=c(7,81,91,67,12,34,56,78,90,12)
Category=c(1,1,2,2,2,1,2,2,1,1)
Play_together=c(1,0,1,1,1,1,1,1,0,1)
Money=c(23,11,78,-9,-12,345,09,43,21,90)
z = data.frame(Event,Indiv1,Indiv2,Category,Play_together,Money)
What I would like to do is to look through each event and each category and take the average value of Money in cases where Play_together == 1. When Play_together==0, then I would like to apply Money/100.
I understand that the loop would look something like the following:
for i in 1:nrow(z){
#loop for event{
#loop for Category{
#Define avg or division function
}
}
}
However, I cannot seem to implement this using a nested loop. I saw another post (link: apply function for each subgroup) which uses dplyr package. I was wondering if someone could help me to implement this without using any packages (I know this might take longer as compared to using R packages). I am trying to learn R and this is the first time I am working with nested loops.
The final output will look like this:
where for event 1, the following holds:
a) For cateory 1:
Play_together ==1 in row 1; we take the avg of Money value and hence final output = 23/1= 23
Play_together==0 in row 2; we take Money/100= 0.11
b) For category 2:
Play_together == 1 for all observations. We take avg Money for all three observations.
This holds similarly for Event 2. In my actual dataset, I have event = 600 and number of category ranging from 1 - 10. Some events may have only 1 category and a maximum of 10 categories. So any function needs to be extremely flexible. The total number of observations in my dataset is around 1.5 million so any changes in the looping process to reduce the time taken to carry out the operation is going to be extremely helpful (Although at this stage my priority is the looping process itself).
Once again it would be a great help if you can show me how to use nested looping and explain the steps in brief. Much appreciated.
will something like this do?
I know it's using dplyr, but that package is made for this kind of jobs ;-)
Event = c(1,1,1,1,1,2,2,2,2,2)
Indiv1=c(4,5,6,11,45,66,8,9,32,45)
Indiv2=c(7,81,91,67,12,34,56,78,90,12)
Category=c(1,1,2,2,2,1,2,2,1,1)
Play_together=c(1,0,1,1,1,1,1,1,0,1)
Money=c(23,11,78,-9,-12,345,09,43,21,90)
z = data.frame(Event,Indiv1,Indiv2,Category,Play_together,Money)
library(dplyr)
df_temp <- z %>%
group_by( Event, Category, Play_together ) %>%
summarise( money_mean = mean( Money ) ) %>%
mutate( final_output = ifelse( Play_together == 0, money_mean / 100, money_mean )) %>%
select( -money_mean )
df <- z %>%
left_join(df_temp, by = c("Event", "Category", "Play_together" )) %>%
arrange(Event, Category)
Consider base R's by, the object-oriented wrapper to tapply designed to subset dataframes by factor(s) but unlike split can pass subsets into a defined function. Then, run conditional logic with ifelse for Final_Output field. Finally, stack all subsetted dataframes for final object.
# LIST OF DATAFRAMES
by_list <- by(z, z[c("Event", "Category")], function(sub) {
tmp <- subset(sub, Play_together==1)
sub$Final_Output <- ifelse(sub$Play_together == 1, mean(tmp$Money), sub$Money/100)
return(sub)
})
# APPEND ALL DATAFRAMES
final_df <- do.call(rbind, by_list)
row.names(final_df) <- NULL
final_df
# Event Indiv1 Indiv2 Category Play_together Money Final_Output
# 1 1 4 7 1 1 23 23.00
# 2 1 5 81 1 0 11 0.11
# 3 2 66 34 1 1 345 217.50
# 4 2 32 90 1 0 21 0.21
# 5 2 45 12 1 1 90 217.50
# 6 1 6 91 2 1 78 19.00
# 7 1 11 67 2 1 -9 19.00
# 8 1 45 12 2 1 -12 19.00
# 9 2 8 56 2 1 9 26.00
# 10 2 9 78 2 1 43 26.00

Resources