I performed a kruskal wallis test on multi-treatment data where I compared five different methods.
A friend showed me the calculation in spss and the results included the mean ranks of each method.
In R, I only get the chi2 and df value and p-value when applying kruskal.test to my data set. those values are equal to the ones in spss but I do not get any ranks.
How can I print out the ranks of the computation?
My code looks like this:
comparison <- kruskal.test(all,V3,p.adj="bon",group=FALSE, main="over")
If I print comparison I get the following:
Kruskal-Wallis rank sum test
data: all
Kruskal-Wallis chi-squared = 131.4412, df = 4, p-value < 2.2e-16
But I would like to get something like this additional output from spss:
Type H Middle Rank
1,00 57 121.11
2,00 57 148.32
3,00 57 217.49
4,00 57 53.75
5,00 57 174.33
total 285
How do I get this done in r?
The table you want you have to compute yourself unfortunately. Luckely I have made a function for you:
#create some random data
ozone <- airquality$Ozone
names(ozone) <- airquality$Month
spssOutput <- function(vector) {
# This function takes your data as one long
# vector and ranks it. After that it computes
# the mean rank of each group. The groupes
# need to be given as names to the vector.
# the function returns a data frame with
# the results in SPSS style.
ma <- matrix(, ncol=3, nrow= 0)
r <- rank(vector, na.last = NA)
to <- 0
for(n in unique(names(r))){
# compute the rank mean for group n
g <- r[names(r) == n]
gt <- length(g)
rm <- sum(g)/gt
to <- to + gt
ma <- rbind(ma, c(n, gt, rm))
}
colnames(ma) <- c("Type","H","Middle Rank")
ma <- rbind(ma, c("total", to, ""))
as.data.frame(ma)
}
# calculate everything
out <- spssOutput(ozone)
print(out, row.names= FALSE)
kruskal.test(Ozone ~ Month, data = airquality)
This gives you the following output:
Type H Middle Rank
5 26 36.6923076923077
6 9 48.7222222222222
7 26 77.9038461538462
8 26 75.2307692307692
9 29 48.6896551724138
total 116
Kruskal-Wallis rank sum test
data: Ozone by Month
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 6.901e-06
You haven't shared your data so you have to figure out yourself how this would work for your data set.
I had an assignment where I had to do this. Make a data frame where one column is the combined values you're ranking, one column is the categories each value belongs to, and the final column is the ranking of each value. The function rank() is the one you need for the actual ranking. The code looks like this:
low <- c(0.56, 0.57, 0.58, 0.62, 0.64, 0.65, 0.67, 0.68, 0.74, 0.78, 0.85, 0.86)
medium <- c(0.70, 0.74, 0.75, 0.76, 0.78, 0.79, 0.80, 0.82, 0.83, 0.86)
high <- c(0.65, 0.73, 0.74, 0.76, 0.81,0.82, 0.85, 0.86, 0.88, 0.90)
data.value <- c(low, medium, high)
data.category <- c(rep("low", length(low)), rep("medium", length(medium)), rep("high", length(high)) )
data.rank <- rank(data.value)
data <- data.frame(data.value, data.category, data.rank)
data
data.value data.category data.rank
1 0.56 low 1.0
2 0.57 low 2.0
3 0.58 low 3.0
4 0.62 low 4.0
5 0.64 low 5.0
6 0.65 low 6.5
7 0.67 low 8.0
8 0.68 low 9.0
9 0.74 low 13.0
10 0.78 low 18.5
11 0.85 low 26.5
12 0.86 low 29.0
13 0.70 medium 10.0
14 0.74 medium 13.0
15 0.75 medium 15.0
16 0.76 medium 16.5
17 0.78 medium 18.5
18 0.79 medium 20.0
19 0.80 medium 21.0
20 0.82 medium 23.5
21 0.83 medium 25.0
22 0.86 medium 29.0
23 0.65 high 6.5
24 0.73 high 11.0
25 0.74 high 13.0
26 0.76 high 16.5
27 0.81 high 22.0
28 0.82 high 23.5
29 0.85 high 26.5
30 0.86 high 29.0
31 0.88 high 31.0
32 0.90 high 32.0
This will give you a table that looks like this.
Related
I have a dataframe as follows:
# A tibble: 6 x 4
Placebo High Medium Low
<dbl> <dbl> <dbl> <dbl>
1 0.0400 -0.04 0.0100 0.0100
2 0.04 0 -0.0100 0.04
3 0.0200 -0.1 -0.05 -0.0200
4 0.03 -0.0200 0.03 -0.00700
5 -0.00500 -0.0100 0.0200 0.0100
6 0.0300 -0.0100 NA NA
You could get the cohensD for two of the columns using the cohen.d() function from the effsize package:
df <- data.frame(Placebo = c(0.0400, 0.04, 0.0200, 0.03, -0.00500, 0.0300),
Low = c(-0.04, 0, -0.1, -0.0200, -0.0100, -0.0100),
Medium = c(0.0100, -0.0100, -0.05, 0.03, 0.0200, NA ),
High = c(0.0100, 0.04, -0.0200, -0.00700, 0.0100, NA))
library(effsize)
cohen.d(as.vector(na.omit(df$Placebo)), as.vector(na.omit(df$High)))
Interestingly enough, I'm getting the following error with this code:
Error in data[, group] : incorrect number of dimensions
However, I would like to create a function that allows you to obtain all the cohensd between one of the columns and the rest of them.
In order to get the cohensD of all columns against the Placebo we would use something like:
sapply(df, function(i) cohen.d(pull(df, as.vector(na.omit(!!Placebo))), as.vector(na.omit(i))))
But I'm not sure this would work anyway.
Edit: I don't want to erase the full row, as cohens d can be computed for different length vectors. Ideally, I would like to get the stat with the NA removed for each column independetly
It may be better to remove the NA on each of the columns separately by creating a logical index along with 'Placebo'
library(dplyr)
library(effsize)
df %>%
summarise(across(Low:High, ~ list({
i1 <- complete.cases(Placebo)& complete.cases(.x)
cohen.d(Placebo[i1], .x[i1])})))
Or if we want to use lapply/sapply, loop over the columns other than Placebo
lapply(df[-1], function(x) {
x1 <- na.omit(cbind(df$Placebo, x))
cohen.d(x1[,1], x1[,2])
})
-output
$Low
Cohen's d
d estimate: 1.947312 (large)
95 percent confidence interval:
lower upper
0.3854929 3.5091319
$Medium
Cohen's d
d estimate: 0.9622504 (large)
95 percent confidence interval:
lower upper
-0.5782851 2.5027860
$High
Cohen's d
d estimate: 0.8884639 (large)
95 percent confidence interval:
lower upper
-0.6402419 2.4171697
I have the following data
PERIOD GROWTH PRICE
2011K1 0.88 0.88
2011K2 0.93 0.93
2011K3 0.96 0.96
2011K4 0.98 0.98
2012K1 1.13
2012K2 1.16
2012K3 1.12
2012K4 1.17
2013K1 1.07
2013K2 1.11
2013K3 1.03
2013K4 1.03
In 2011 PRICE = GROWTH
In 2012K1 PRICE = GROWTH[2012K1]*avg(PRICE in 2011)
In 2012K2 PRICE = GROWTH[2012K2]*avg(PRICE in 2011)
In 2012K3 PRICE = GROWTH[2012K3]*avg(PRICE in 2011)
In 2012K4 PRICE = GROWTH[2012K4]*avg(PRICE in 2011)
In 2013K1 PRICE = GROWTH[2013K1]*avg(PRICE in 2012)
In 2013K2 PRICE = GROWTH[2013K2]*avg(PRICE in 2012)
In 2013K3 PRICE = GROWTH[2013K3]*avg(PRICE in 2012)
In 2013K4 PRICE = GROWTH[2013K4]*avg(PRICE in 2012)
...
In each quarter the average price from the previous quarter is used to multiply GROWTH in that particular quarter, i.e. each quarter within the same year is multiplied by the same average price, which is the average price in the year before.
I tried using cumprod() but failed to make it roll annually when my data is quarterly. I can do for-loop, the problem is I have to do this for thousands to products.
Any suggestions?
Given the data in the Note at the end compute the quarters, qtr and then loop through the rows computing PRICE. No packages are used.
run <- function(DF, k = 4) {
nr <- nrow(DF)
DF$qtr <- 1:k
for(i in (k+1):nr) DF$PRICE[i] <- DF$GROWTH[i] * mean(DF$PRICE[i-DF$qtr[i]-(k-1):0])
DF
}
run(DF)
giving:
PERIOD GROWTH PRICE qtr
1 2011K1 0.88 0.880000 1
2 2011K2 0.93 0.930000 2
3 2011K3 0.96 0.960000 3
4 2011K4 0.98 0.980000 4
5 2012K1 1.13 1.059375 1
6 2012K2 1.16 1.087500 2
7 2012K3 1.12 1.050000 3
8 2012K4 1.17 1.096875 4
9 2013K1 1.07 1.148578 1
10 2013K2 1.11 1.191516 2
11 2013K3 1.03 1.105641 3
12 2013K4 1.03 1.105641 4
Note
Lines <- "PERIOD GROWTH PRICE
2011K1 0.88 0.88
2011K2 0.93 0.93
2011K3 0.96 0.96
2011K4 0.98 0.98
2012K1 1.13
2012K2 1.16
2012K3 1.12
2012K4 1.17
2013K1 1.07
2013K2 1.11
2013K3 1.03
2013K4 1.03"
DF.orig <- read.table(text = Lines, header = TRUE, fill = TRUE, as.is = TRUE)
-- Update: Realized this answer produces incorrect result -- #Rebecca
Another option :)
# I'll use tidyverse for this approach.
library(tidyverse)
# First, I'll generate a dataset similar to yours.
data <- tibble(year = rep(2011:2013, each=4),
quarter = rep(1:4, times=3),
growth_quarter = c(0.88,
0.93,
0.96,
0.98,
1.13,
1.16,
1.12,
1.17,
1.07,
1.11,
1.03,
1.03))
# Create a new tibble with desired output.
data_m <- data %>%
# Find the average growth per year.
group_by(year) %>%
mutate(growth_annual = mean(growth_quarter)) %>%
# Remove grouping by year for next calculations.
ungroup() %>%
# Organize by year and quarter to ensure consistent results for calculation in next step.
arrange(year, quarter) %>%
# Multiply current quarter's growth by last year's average growth.
mutate(growth_quarter*lag(growth_annual))
Please let me know if you have any questions!
I have a data frame called "grass". One of the information in this data frame is "Line" which can be: high, low, f1, f2, bl or bh.
I created a new column and want to add information to this column as the following code shows.
The problem is that I get "1" for all, not just for "high"
#add new column
grass["genome.inherited"] <- NA
#adding information to genome.inherited
#1 for the high-tolerance parent genotype (high)
#0 for the low-tolerance parent genotype (low)
#0.5 for the F1 and F2 hybrids (f1) (f2)
#0.25 for the backcross to the low tolerance population (bl)
#0.75 for the backcross to the high tolerance population (bh)
#how I tried to solve the problem
grass$genome.inherited <- if(grass$line == 'high'){
1
} else if(grass$line == 'low'){
0
} else if(grass$line == 'bl'){
0.25
} else if(grass$line == 'bh'){
0.75
} else {
0.5
}
As suggested here is the output for head(grass)
line cube.root.height genome.inherited
high 4.13 1
high 5.36 1
high 4.37 1
high 5.08 1
high 4.85 1
high 5.59 1
Thank you!
How about using the match function. It gives a number that indicates the position of a value in a character vector and has an "nomatch" value as well.
grass$genome.inherited <- c(1, 0, 0.25, 0.75, 0.5)[
match( grass$line, c( 'high', 'low','bl','bh'), nomatch=5) ]
Example from console with other values of line to test:
grass <- read.table(text="line cube.root.height genome.inherited
high 4.13 1
high 5.36 2
low 4.37 1
high 5.08 1
junk 4.85 1
high 5.59 1
", head=T)
grass$genome.inherited <- c(1, 0, 0.25, 0.75, 0.5)[
match( grass$line, c( 'high', 'low','bl','bh'), nomatch=5) ]
grass
#----
line cube.root.height genome.inherited
1 high 4.13 1.0
2 high 5.36 1.0
3 low 4.37 0.0
4 high 5.08 1.0
5 junk 4.85 0.5
6 high 5.59 1.0
You dont have to create a new column with NA. Here is the code which does it for you.
grass$genome_inherited_values <- ifelse(grass$line == 'high', 1,
ifelse(grass$line == 'low', 0,
ifelse(grass$line == 'bl',0.25,
ifelse(grass$line == 'bh',0.75,0.5)
I agree (with 42-) that nested ifelse statements is not preferred. #42-'s solution of match is (imo) far better than ifelses.
An alternative is to merge them.
Data:
grass <- read.table(text="line cube.root.height
high 4.13
high 5.36
low 4.37
high 5.08
junk 4.85
high 5.59
", head=TRUE, stringsAsFactors=FALSE)
The table of values to merge in:
genome <- data.frame(
line=c("high","low","bl","bh"),
genome.inherited=c(1, 0, 0.25, 0.75),
stringsAsFactors=FALSE)
The merge:
grass2 <- merge(grass, genome, by="line", all.x=TRUE)
If you look at the data, you'll see an NA, because "junk" (an unknown value) is not present in the genome table and therefore assigned as NA. We can fix this with an easy step:
grass2$genome.inherited[is.na(grass2$genome.inherited)] <- 0.5
grass2
# line cube.root.height genome.inherited
# 1 high 4.13 1.0
# 2 high 5.36 1.0
# 3 high 5.08 1.0
# 4 high 5.59 1.0
# 5 junk 4.85 0.5
# 6 low 4.37 0.0
#42-'s answer has the advantage of providing a default (nomatch) value in the initial call.
Your if conditions have length > 1. When the condition has length > 1 only the first element will be used and that's why you are getting all 1s.
Here's a different (simpler than nested ifelse) approach for the same thing -
vals <- c(high = 1, low = 0, f1 = 0.5, f2 = 0.5, bl = 0.25, bh = 0.75)
grass$genome.inherited <- vals[as.character(grass$line)]
I have an R script that I am using to create a new data matrix consisting of bin values from a matrix of continuous data. Right now it works fine using this command:
quant.mat <- apply(input.dat,2,quantile)
But this gives me the standard quantile distribution of 0.25, 0.5, and 0.75. What I want is to be able to arbitrary specify different values (e.g. 0.2, 0.4, 0.6, 0.8). I can't seem to make it work.
You can pass arbitrary probability values to the probs parameter, for example, if you have a random data frame:
input.dat
A B C D
1 78 12 43 12
2 23 12 42 13
3 14 42 11 99
4 49 94 27 72
apply(input.dat, 2, quantile, probs = c(0.2, 0.4, 0.6, 0.8))
A B C D
20% 19.4 12.0 20.6 12.6
40% 28.2 18.0 30.0 24.8
60% 43.8 36.0 39.0 60.2
80% 60.6 62.8 42.4 82.8
you can pass the cuts that you like to the quantile function. For example:
values<-runif(100,0,1)
quantile(values, c(0.2,0.4,0.6,0.8))
Does it answer your question?
I'm working with biochemical data from subjects, analysing the results by sex. I have 19 biochemical tests to analyse for each sex, for each of two drugs (haematology and anatomy tests coming later).
For reasons of reproducibility of results and for preventing transcription errors, I am trying to summarise each test into one table. Included in the table output, I need a column for the Dunnett post hoc comparison p-values. Because the Dunnett test compares to the control results, with a control and 3 drug levels I only get 3 p-values. However, I have 4 mean and sd values.
Using ddply to get the mean and sd results (having limited the number of significant figures, I get a dataset that looks like this:
Sex<- c(rep("F",4), rep("M",4))
Druglevel <- c(rep(0:3,2))
Sample <- c(rep(10,8))
Mean <- c(0.44, 0.50, 0.46, 0.49, 0.48, 0.55, 0.47, 0.57)
sd <- c(0.07, 0.07, 0.09, 0.12, 0.18, 0.19, 0.13, 0.41)
Drug1Biochem1 <- data.frame(Sex, Druglevel, Sample, Mean, sd)
I have used glht in the package multcomp to perform the Dunnett tests on the aov object I constructed from undertaking a normal aov. I've extracted the p-values from the glht summary (I've rounded these to three decimal places). The male and female analyses have been run using separate ANOVA so I have one set of output for each sex. The female results are:
femaleR <- c(0.371, 0.973, 0.490)
and the male results are:
maleR <- c(0.862, 0.999, 0.738)
How can I append a column for the p-values to my original dataframe (Drug1Biochem1) so that both femaleR and maleR are in that final column, with row 1 and row 5 of that column empty (i.e. no p-values for the control)?
I wish to output the resulting combination to html, which can be inserted into a Word document so no transcription errors occur. I have set a seed value so that the results of the program are reproducible (when I finally stop debugging).
In summary, I would like a data frame (or table, or whatever I can output to html) that has the following format:
Sex Druglevel Sample Mean sd p-value
F 0 10 0.44 0.07
F 1 10 0.50 0.07 0.371
F 2 10 0.46 0.09 0.973
F 3 10 0.49 0.12 0.480
M 0 10 0.48 0.18
M 1 10 0.55 0.19 0.862
M 2 10 0.47 0.13 0.999
M 3 10 0.57 0.41 0.738
For each test, I wish to reproduce this exact table. There will always be 4 groups per sex, and there will never be a p-value for the control, which will always be summarised in row 1 (F) and row 5 (M).
You could try merge
dN <- data.frame(Sex=rep(c('M', 'F'), each=3), Druglevel=1:3,
pval=c(maleR, femaleR))
merge(Drug1Biochem1, dN, by=c('Sex', 'Druglevel'), all=TRUE)
# Sex Druglevel Sample Mean sd pval
#1 F 0 10 0.44 0.07 NA
#2 F 1 10 0.50 0.07 0.371
#3 F 2 10 0.46 0.09 0.973
#4 F 3 10 0.49 0.12 0.490
#5 M 0 10 0.48 0.18 NA
#6 M 1 10 0.55 0.19 0.862
#7 M 2 10 0.47 0.13 0.999
#8 M 3 10 0.57 0.41 0.738