Plotting upper quanitles only from dataframe containg all values - r

I have a large dataframe,df, which contains a list of nonunique identifiers (Cell.ID) and information within that identifier. It looks something like this:
Cell.ID Volume
1 025001G 2.08
2 025001G 0.30
3 025001G 0.99
4 025001G 0.60
5 025001G 0.43
6 025001G 0.24
7 025001G 0.59
8 025001R 1.74
9 025001R 1.09
10 025001R 0.58
11 025001R 0.75
12 025001R 0.62
13 025002G 8.59
14 025002G 1.26
15 025002R 6.31
16 025002R 0.56
17 025003G 1.95
18 025003G 2.18
19 025003G 0.21
What I would like to do is make a plot where the Y axis corresponds to Volume and the X coordinate corresponds to number of instances of a particular Cell.ID. This part was straight forward, but I would like the Y coordinate for each object to either be a box spanning the upper two quantiles or a point representing the second highest quantile. Using tapply(df$Volume,quantile), and table(df$Cell.ID) I was able to create a data frame that looks like the one below, which contains the requisite information to make said plot. Freq contains information on how many times a particular Cell.ID (row name) has shown up and Quantile contains the information on the distribution of volumes for objects in that Cell.ID.
row.names quantile Var1 Freq
1 010001G c(0.27, 0.27, 0.325, 0.6125, 1.31) 010001G 4
2 010001R c(0.22, 0.365, 0.51, 0.655, 0.8) 010001R 2
3 010002G c(0.67, 0.8025, 0.935, 1.0675, 1.2) 010002G 2
4 010002R c(0.25, 0.41, 0.57, 0.73, 0.89) 010002R 2
5 010003G c(0.22, 0.295, 0.345, 0.3725, 0.38) 010003G 4
6 010003R c(0.22, 0.2675, 0.315, 0.3625, 0.41) 010003R 2
7 010004G c(0.35, 0.41, 0.625, 1.165, 2.2) 010004G 4
8 010004R c(0.2, 0.4075, 0.615, 0.8225, 1.03) 010004R 2
9 010005G c(3.95, 3.95, 3.95, 3.95, 3.95) 010005G 1
10 010005R c(0.47, 0.775, 1.08, 2.53, 3.98) 010005R 3
11 010006G c(0.25, 0.98, 1.71, 2.98, 4.25) 010006G 3
However I'm stuck on how to select only certain quantiles in each row to plot from the quantile column. I've tried a few things but get errors such as this:
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' is a list, but does not have components 'x' and 'y

If I understand your question correctly you don't need all quantiles, just one or two of them. So you can try something like that:
Q75 <- tapply(df$Volume, df$Cell.ID, quantile, probs = 0.75)
freq <- table(df$Cell.ID)
plot(x = as.vector(freq), y = Q75,
xlab = "Frequency", ylab = "75th Quantile")
Or for the 75th and 95th quantiles:
Q7595 <- do.call(rbind.data.frame,
tapply(df$Volume, df$Cell.ID, quantile,
probs = c(0.75, 0.95), simplify = TRUE))
## Empty plot
matplot(x = as.vector(freq), y = Q7595, type = "n",
xlab = "Frequency", ylab = "75th and 95th Quantiles")
## Boxes
rect(xleft = as.vector(freq) - 0.25, xright = as.vector(freq) + 0.25,
ytop = Q7595[,1], ybottom = Q7595[,2])
The result looks like that:
Of course it needs some aesthetic changes, but I hope it helps,
alex

Related

I want to plot multiple lines together in a single plot using geom_abline, but each line has unique x limits

So I have a bunch of studies and I want to plot the regression lines obtained from all of them.
I have a set of slope and intercept values, and x limits for each line, but I need to set the x limits for each line separately as they're all different...
I've racked my brain for days but can't figure out how I can achieve this using ggplot, if I can give a list of xlim values to pass to the slope and intercept values?
Here's a subset of my dataframe...
empirical_studies <- tibble(slope = c(-1.52, -1.42, -1.56, -1.57, -1.57, -1.68, -1.67, -1.6, -1.73, -1.69, -1.79), intercept = c(12.07, 11.29, 12.21, 12.26, 12.14, 12.5, 12.58, 12.28, 12.72, 12.53, 12.29), xmin = c(9.00, 6.00, 6.00, 6.00, 6.00, 9.00, 7.00, 7.00, 7.00, 7.00, 18.00), xmax = c(26.00 59.00, 53.00, 53.00, 53.00, 63.00, 67.00, 64.00, 64.00, 64.00, 50.00))
Would really appreciate any help!
I didn't really try much to be honest, I know I can't pass in a set of xlim values... I tried writing a loop (below) but this doesn't seem to work either:
plot = ggplot(empirical_studies, aes(x=empirical_studies$diameter_stem_average, y=empirical_studies$density))
for (i in 1:(length(empirical_studies)-1)){
plot=plot+geom_abline(slope = empirical_studies$slope[i], intercept = empirical_studies$intercept[i]) +
xlim(empirical_studies$min_dq[i], empirical_studies$max_dq[i])
i <- i + 1
}
print(plot)
If you want to plot separate line segments you can use geom_line() with a start and end point.
Since you already have the x values of the min and max, you can simply use the slope and intercept to calculate the y value for those points and then plot it. For these manipulations you need to uniquely identify each curve so I just applied a number for each, but you could use more informative naming depending on your data.
I show two options below depending on if you want to overplot everything or if you want separate plots for each. For the faceted option, you can specify sacles = "free" to let each facet rescale the axes individually. Otherwise it picks the widest range that captures all the data.
library(tidyverse)
empirical_studies <- tibble(slope = c(-1.52, -1.42, -1.56, -1.57, -1.57, -1.68, -1.67, -1.6, -1.73, -1.69, -1.79),
intercept = c(12.07, 11.29, 12.21, 12.26, 12.14, 12.5, 12.58, 12.28, 12.72, 12.53, 12.29),
xmin = c(9.00, 6.00, 6.00, 6.00, 6.00, 9.00, 7.00, 7.00, 7.00, 7.00, 18.00),
xmax = c(26.00, 59.00, 53.00, 53.00, 53.00, 63.00, 67.00, 64.00, 64.00, 64.00, 50.00))
# generate list of start and end points for each line
d <- empirical_studies %>%
mutate(study_id = row_number()) %>%
mutate(across(starts_with("x"), ~ slope * . + intercept, .names = "y{str_sub(.col, 2, 4)}")) %>%
pivot_longer(starts_with(c("x", "y"))) %>%
separate(name, into = c("var", "position"), sep = 1) %>%
pivot_wider(names_from = var)
d
#> # A tibble: 22 × 6
#> slope intercept study_id position x y
#> <dbl> <dbl> <int> <chr> <dbl> <dbl>
#> 1 -1.52 12.1 1 min 9 -1.61
#> 2 -1.52 12.1 1 max 26 -27.4
#> 3 -1.42 11.3 2 min 6 2.77
#> 4 -1.42 11.3 2 max 59 -72.5
#> 5 -1.56 12.2 3 min 6 2.85
#> 6 -1.56 12.2 3 max 53 -70.5
#> 7 -1.57 12.3 4 min 6 2.84
#> 8 -1.57 12.3 4 max 53 -71.0
#> 9 -1.57 12.1 5 min 6 2.72
#> 10 -1.57 12.1 5 max 53 -71.1
#> # … with 12 more rows
# plotted together
d %>%
ggplot(aes(x, y)) +
geom_line(aes(color = factor(study_id)))
# plotted separately
d %>%
ggplot(aes(x, y)) +
geom_line() +
facet_wrap(~study_id) # can add scales = "free" to scale axes separately
Created on 2022-11-09 with reprex v2.0.2

Remove leading zeros in numbers *within a data frame*

Edit: For anyone coming later: THIS IS NOT A DUPLICATE, since it explicitely concerns work on data frames, not single variables/vectors.
I have found several sites describing how to drop leading zeros in numbers or strings, including vectors. But none of the descriptions I found seem applicable to data frames.
Or the f_num function in the numform package. It treats "[a] vector of numbers (or string equivalents)", but does not seem to solve unwanted leading zeros in a data frame.
I am relatively new to R but understand that I could develop some (in my mind) complex code to drop leading zeros by subsetting vectors from a data frame and then combining those vectors into a full data frame. I would like to avoid that.
Here is a simple data frame:
df <- structure(list(est = c(0.05, -0.16, -0.02, 0, -0.11, 0.15, -0.26,
-0.23), low2.5 = c(0.01, -0.2, -0.05, -0.03, -0.2, 0.1, -0.3,
-0.28), up2.5 = c(0.09, -0.12, 0, 0.04, -0.01, 0.2, -0.22, -0.17
)), row.names = c(NA, 8L), class = "data.frame")
Which gives
df
est low2.5 up2.5
1 0.05 0.01 0.09
2 -0.16 -0.20 -0.12
3 -0.02 -0.05 0.00
4 0.00 -0.03 0.04
5 -0.11 -0.20 -0.01
6 0.15 0.10 0.20
7 -0.26 -0.30 -0.22
8 -0.23 -0.28 -0.17
I would want
est low2.5 up2.5
1 .05 .01 .09
2 -.16 -.20 -.12
3 -.02 -.05 .00
4 .00 -.03 .04
5 -.11 -.20 -.01
6 .15 .10 .20
7 -.26 -.30 -.22
8 -.23 -.28 -.17
Is that possible with relatively simple code for a whole data frame?
Edit: An incorrect link has been removed.
I am interpreting the intention of your question is to convert each numeric cell in the data.frame into a "pretty-printed" string which is possible using string substitution and a simple regular expression (a good question BTW since I do not know any method to configure the output of numeric data to suppress leading zeros without converting the numeric data into a string!):
df2 <- data.frame(lapply(df,
function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", as.character(x)))),
stringsAsFactors = FALSE)
df2
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.2 -.12
# 3 -.02 -.05 0
# 4 0 -.03 .04
# 5 -.11 -.2 -.01
# 6 .15 .1 .2
# 7 -.26 -.3 -.22
# 8 -.23 -.28 -.17
str(df2)
# 'data.frame': 8 obs. of 3 variables:
# $ est : chr ".05" "-.16" "-.02" "0" ...
# $ low2.5: chr ".01" "-.2" "-.05" "-.03" ...
# $ up2.5 : chr ".09" "-.12" "0" ".04" ...
If you want to get a fixed number of digits after the decimal point (as shown in the expected output but not asked for explicitly) you could use sprintf or format:
df3 <- data.frame(lapply(df, function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", sprintf("%.2f", x)))), stringsAsFactors = FALSE)
df3
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.20 -.12
# 3 -.02 -.05 .00
# 4 .00 -.03 .04
# 5 -.11 -.20 -.01
# 6 .15 .10 .20
# 7 -.26 -.30 -.22
# 8 -.23 -.28 -.17
Note: This solution is not robust against different decimal point character (different locales) - it always expects a decimal point...

R - using dplyr to aggregate on a continuous variable

So I have a data frame of participant data, where I have participant IDs, for each of those a bunch of target values (continuous) and predicted values.
The target value is a continuous variable, but there is a finite number of possible values, and each participant will have made a prediction for a subset of these target values.
For example, take this data frame:
data.frame(
subjectID = c(rep("p001",4),rep("p002",4),rep("p003",4)),
target = c(0.1,0.2,0.3,0.4,0.2,0.3,0.4,0.5,0.1,0.3,0.4,0.5),
pred = c(0.12, 0.23, 0.31, 0.42, 0.18, 0.32, 0.44, 0.51, 0.09, 0.33, 0.41, 0.55)
)
There're 5 possible target values: 0.1, 0.2, 0.3, 0.4 and 0.5, but each participant only predicted 4 of these values each. I want to get the average prediction pred for each target value target. It's further complicated by each participant having a group, and I only want to average within each group.
I tried using summarise_at but it wasn't liking the continuous data, and whilst I'm pretty experienced in coding in R, it's been a long while since I've done data summary manipulations etc.
I could do this easily in a for loop, but I want to learn to do this properly and I wasn't able to find a solution after googling for a long time.
Thanks very much
H
Just add the second grouping variable in group_by as well:
df <- data.frame(
subjectID = c(rep("p001",4),rep("p002",4),rep("p003",4)),
group = c(rep("A", 8), rep("B", 4)),
target = c(0.1,0.2,0.3,0.4,0.2,0.3,0.4,0.5,0.1,0.3,0.4,0.5),
pred = c(0.12, 0.23, 0.31, 0.42, 0.18, 0.32, 0.44, 0.51, 0.09, 0.33, 0.41, 0.55)
)
df %>%
group_by(target, group) %>%
summarise(mean(pred))
Output:
# A tibble: 9 x 3
# Groups: target [?]
target group `mean(pred)`
<dbl> <chr> <dbl>
1 0.100 A 0.120
2 0.100 B 0.0900
3 0.200 A 0.205
4 0.300 A 0.315
5 0.300 B 0.330
6 0.400 A 0.430
7 0.400 B 0.410
8 0.500 A 0.510
9 0.500 B 0.550

Assign groups based on the trend

I have searched a lot for this simple question, but have not found a solution. It looks really simple. I have a dataframe with a column like this:
Value
0.13
0.35
0.62
0.97
0.24
0.59
0.92
0.16
0.29
0.62
0.98
All values have a range between 0 and 1. What I want is that when the value starts to drop, I assign a new group to it. Within each group, the value is increasing. So the ideal outcome will look like this:
Value Group
0.13 1
0.35 1
0.62 1
0.97 1
0.24 2
0.59 2
0.92 2
0.16 3
0.29 3
0.62 3
0.98 3
Does anyone have a suggestion for how to address this?
This should do the trick, and uses only vectorised base functions. You may want to exchange the < for <=, if thats the behaviour you wanted.
vec <- c(0.13, 0.35, 0.62, 0.97, 0.24, 0.59, 0.92, 0.16, 0.29, 0.62, 0.98)
cumsum(c(1, diff(vec) < 0))
This isn't the most elegant solution, but it works:
value <- c(0.13, 0.35, 0.62, 0.97, 0.24, 0.59, 0.92, 0.16, 0.29, 0.62, 0.98)
foo <- data.frame(value, group = 1)
current_group <- 1
for(i in 2:nrow(foo)){
if(foo$value[i] >= foo$value[i-1]){
foo$group[i] <- current_group
}else{
current_group <- current_group + 1
foo$group[i] <- current_group
}
}
df <- data.frame( x = c(0.13, 0.35, 0.62, 0.97, 0.24, 0.59, 0.92, 0.16, 0.29, 0.62, 0.98))
df$y <- c(df$x[-1], NA) # lag column
df$chgdir <- as.numeric(df$y - df$x < 0) # test for change in direction
df$chgdir[is.na(df$chgdir)] <- 0 # deal with NA
df$group <- cumsum(df$chgdir) + 1 # determine group number
df[,c("x", "group")]
#> x group
#> 1 0.13 1
#> 2 0.35 1
#> 3 0.62 1
#> 4 0.97 2
#> 5 0.24 2
#> 6 0.59 2
#> 7 0.92 3
#> 8 0.16 3
#> 9 0.29 3
#> 10 0.62 3
#> 11 0.98 3

Show Kruskal-Wallis test ranks

I performed a kruskal wallis test on multi-treatment data where I compared five different methods.
A friend showed me the calculation in spss and the results included the mean ranks of each method.
In R, I only get the chi2 and df value and p-value when applying kruskal.test to my data set. those values are equal to the ones in spss but I do not get any ranks.
How can I print out the ranks of the computation?
My code looks like this:
comparison <- kruskal.test(all,V3,p.adj="bon",group=FALSE, main="over")
If I print comparison I get the following:
Kruskal-Wallis rank sum test
data: all
Kruskal-Wallis chi-squared = 131.4412, df = 4, p-value < 2.2e-16
But I would like to get something like this additional output from spss:
Type H Middle Rank
1,00 57 121.11
2,00 57 148.32
3,00 57 217.49
4,00 57 53.75
5,00 57 174.33
total 285
How do I get this done in r?
The table you want you have to compute yourself unfortunately. Luckely I have made a function for you:
#create some random data
ozone <- airquality$Ozone
names(ozone) <- airquality$Month
spssOutput <- function(vector) {
# This function takes your data as one long
# vector and ranks it. After that it computes
# the mean rank of each group. The groupes
# need to be given as names to the vector.
# the function returns a data frame with
# the results in SPSS style.
ma <- matrix(, ncol=3, nrow= 0)
r <- rank(vector, na.last = NA)
to <- 0
for(n in unique(names(r))){
# compute the rank mean for group n
g <- r[names(r) == n]
gt <- length(g)
rm <- sum(g)/gt
to <- to + gt
ma <- rbind(ma, c(n, gt, rm))
}
colnames(ma) <- c("Type","H","Middle Rank")
ma <- rbind(ma, c("total", to, ""))
as.data.frame(ma)
}
# calculate everything
out <- spssOutput(ozone)
print(out, row.names= FALSE)
kruskal.test(Ozone ~ Month, data = airquality)
This gives you the following output:
Type H Middle Rank
5 26 36.6923076923077
6 9 48.7222222222222
7 26 77.9038461538462
8 26 75.2307692307692
9 29 48.6896551724138
total 116
Kruskal-Wallis rank sum test
data: Ozone by Month
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 6.901e-06
You haven't shared your data so you have to figure out yourself how this would work for your data set.
I had an assignment where I had to do this. Make a data frame where one column is the combined values you're ranking, one column is the categories each value belongs to, and the final column is the ranking of each value. The function rank() is the one you need for the actual ranking. The code looks like this:
low <- c(0.56, 0.57, 0.58, 0.62, 0.64, 0.65, 0.67, 0.68, 0.74, 0.78, 0.85, 0.86)
medium <- c(0.70, 0.74, 0.75, 0.76, 0.78, 0.79, 0.80, 0.82, 0.83, 0.86)
high <- c(0.65, 0.73, 0.74, 0.76, 0.81,0.82, 0.85, 0.86, 0.88, 0.90)
data.value <- c(low, medium, high)
data.category <- c(rep("low", length(low)), rep("medium", length(medium)), rep("high", length(high)) )
data.rank <- rank(data.value)
data <- data.frame(data.value, data.category, data.rank)
data
data.value data.category data.rank
1 0.56 low 1.0
2 0.57 low 2.0
3 0.58 low 3.0
4 0.62 low 4.0
5 0.64 low 5.0
6 0.65 low 6.5
7 0.67 low 8.0
8 0.68 low 9.0
9 0.74 low 13.0
10 0.78 low 18.5
11 0.85 low 26.5
12 0.86 low 29.0
13 0.70 medium 10.0
14 0.74 medium 13.0
15 0.75 medium 15.0
16 0.76 medium 16.5
17 0.78 medium 18.5
18 0.79 medium 20.0
19 0.80 medium 21.0
20 0.82 medium 23.5
21 0.83 medium 25.0
22 0.86 medium 29.0
23 0.65 high 6.5
24 0.73 high 11.0
25 0.74 high 13.0
26 0.76 high 16.5
27 0.81 high 22.0
28 0.82 high 23.5
29 0.85 high 26.5
30 0.86 high 29.0
31 0.88 high 31.0
32 0.90 high 32.0
This will give you a table that looks like this.

Resources