I am trying to label the min, median, and max data into the boxplot that I created. However, the boxplot is created with two different data frames, and thus it confused of how should I label the data value
Dummy variable:
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)
class1<- data.frame(Name, Age)
boxplot(class1$Age)
Name1 <- c("Suma", "Mia", "Sam", "Jon", "Brian", "Grace", "Julia")
Age1<- c(33, 21, 56,32,65,32,89)
class2 <-data.frame(Name1, Age1)
boxplot(class1$Age, class2$Age1, names = c("class1", "class2"),ylab= "age", xlab= "class")
I am trying to include the data value into the boxplot (shown in image), and its indication (ex: min, median, max)
Many thanks
You could use the function text with fivenum to get the numbers of each boxplot with labels argument and place them using x and y positions like this:
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)
class1<- data.frame(Name, Age)
Name1 <- c("Suma", "Mia", "Sam", "Jon", "Brian", "Grace", "Julia")
Age1<- c(33, 21, 56,32,65,32,89)
class2 <-data.frame(Name1, Age1)
boxplot(class1$Age, class2$Age1, names = c("class1", "class2"),ylab= "age", xlab= "class")
text(y = fivenum(class1$Age), labels = fivenum(class1$Age), x=0.5)
text(y = fivenum(class2$Age), labels = fivenum(class2$Age), x=2.5)
Created on 2023-01-01 with reprex v2.0.2
If you only want the min (1), median(3) and max(5) you can simply extract the first, third and fifth value of the fivenum function like this:
boxplot(class1$Age, class2$Age1, names = c("class1", "class2"),ylab= "age", xlab= "class")
text(y = fivenum(class1$Age)[c(1,3,5)], labels = fivenum(class1$Age)[c(1,3,5)], x=0.5)
text(y = fivenum(class2$Age)[c(1,3,5)], labels = fivenum(class2$Age)[c(1,3,5)], x=2.5)
Created on 2023-01-01 with reprex v2.0.2
The following code adds a new column Class which contains the Classnames to both DF. With rbind both DF are bind together.
Then the boxplot is created in which at defines a bit more space between each boxplot.
With tapply fivenum is calculated for each Class. And with these numbers a new DF is made which contain the necessary text for the annotations in text.
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina")
Age <- c(23, 41, 32, 58, 26)
Class <- rep("Class1", 5)
class1 <- data.frame(Name, Age, Class)
Name1 <- c("Suma", "Mia", "Sam", "Jon", "Brian", "Grace", "Julia")
Age1 <- c(33, 21, 56, 32, 65, 32, 89)
Class1 <- rep("Class2", 7)
class2 <- data.frame(Name = Name1, Age = Age1, Class = Class1)
df <- rbind(class1, class2)
bp <- boxplot(df$Age ~ factor(df$Class),
names = c("Class1", "Class2"),
ylim = c(0, 100),
xlim = c(0, 5),
xlab = "", ylab = "Age",
frame = F,
at = c(1, 3)
)
box(bty = "l")
fn <- tapply(df$Age, df$Class, fivenum)
tex <- data.frame(
Class = c("Class1", "Class2"),
max = c(fn$Class1[5], fn$Class2[5]),
min = c(fn$Class1[1], fn$Class2[1]),
median = c(fn$Class1[3], fn$Class2[3])
)
text(x = c(1, 3), y = tex$max + 2.5, paste(tex$max, "(max)", sep = ""))
text(x = c(1, 3), y = tex$min - 2.5, paste(tex$min, "(min)", sep = ""))
text(x = c(1.9, 3.9), y = tex$median, paste(tex$median, "(median)", sep = ""))
Related
Using a dataframe with missing values:
structure(list(id = c("id1", "test", "rew", "ewt"), total_frq_1 = c(54, 87, 10, 36), total_frq_2 = c(45, 24, 202, 43), total_frq_3 = c(24, NA, 25, 8), total_frq_4 = c(36, NA, 104, NA)), row.names = c(NA, 4L), class = "data.frame")
How is is possible to create a bar plot with the mean for every column, excluding the id column, but without filling the missing values with 0 but leaving out the row with missing values example for total_frq_3 24+25+8 = 57/3 = 19
You can use colMeans function and pass it the appropriate argument to ignore NA.
library(ggplot2)
xy <- structure(list(id = c("id1", "test", "rew", "ewt"),
total_frq_1 = c(54, 87, 10, 36), total_frq_2 = c(45, 24, 202, 43), total_frq_3 = c(24, NA, 25, 8),
total_frq_4 = c(36, NA, 104, NA)),
row.names = c(NA, 4L),
class = "data.frame")
xy.means <- colMeans(x = xy[, 2:ncol(xy)], na.rm = TRUE)
xy.means <- as.data.frame(xy.means)
xy.means$total <- rownames(xy.means)
ggplot(xy.means, aes(x = total, y = xy.means)) +
theme_bw() +
geom_col()
Or just use base image graphic
barplot(height = colMeans(x = xy[, 2:ncol(xy)], na.rm = TRUE))
How to make wordcloud plot based on two columns values?
I have a dataframe as follows:
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina", "Vikram", "Ramesh", "Luther")
Age <- c(23, 41, 32, 58, 26, 41, 32, 58)
Pval <- c(0.01, 0.06, 0.001, 0.002, 0.025, 0.05, 0.01, 0.0002)
df <- data.frame(Name, Age, Pval)
I want to make wordcloud plot for df$Name based on values in df$Age and df$Pval. I used following code:
library("tm")
library("SnowballC")
library("wordcloud")
library("wordcloud2")
library("RColorBrewer")
set.seed(1234)
wordcloud(words = df$Name, freq = df$Age, min.freq = 1,
max.words=10, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Here Luther & Ben are of same size, but I need to make Luther to be slightly bigger than Ben as it has lower Pval.
A quick fix workaround:
library("dplyr")
library("scales")
library("wordcloud")
library("RColorBrewer")
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina", "Vikram", "Ramesh", "Luther")
Age <- c(23, 41, 32, 58, 26, 41, 32, 58)
Pval <- c(0.01, 0.06, 0.001, 0.002, 0.025, 0.05, 0.01, 0.0002)
df <- data.frame(Name, Age, Pval)
df <- df %>%
group_by(Age) %>%
mutate(rank = rank(Pval)) %>% #rank pvalue by age
mutate(weight = scales::rescale(rank/max(rank), to=c(0,1))) %>%
#this is just to make sure that we don't add more than one to the mix
mutate(weight = Age + (1-weight) ) #because rank is inversed
#the final thing adds 0.5 if there is not anyone with the same age and 1 if
#there is someone else but you have a smaller p-val (it also should work if
# there is more than 2 person with the same age)
set.seed(1234)
wordcloud(words = df$Name, freq = df$weight, min.freq = 1,
max.words=10, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Fun and interesting question btw
I am trying to subset a data.table within a function, but subsetting by using !is.na(x) is not working. I know it could work, because as I was building my example on a still simpler problem, the subset call worked fine.
library(data.table)
library(ggpubr)
tj = as.data.table(cbind(Name = c("Tom", "Tom", "Tim", "Jerry", NA, "Jerry", "Tim", NA),
var1 = c(12, 12, 20, 30, 31, 21, 21, 31),
var2 = c(12, 11, 27, 32, 31, 11, 21, 41),
var3 = c(10, 10,11, 13, 12, 12, 11, 10),
time = as.numeric(c(1, 2, 1,1, 1,2,2,2))))
plot.tj<- function(dat = tj, color = NULL) {
name <- names(dat)[2:4] # a factor of names to loop over
for (i in seq_along(name)) {
plotms <- ggline(dat[!is.na(color),], x = "time", y = name[i], color = color)
print(plotms)
}
}
plot.tj(color = "Name")
The expected output are the 3 var graphs, but without the NA group.
The thing is that your variable color is a character, so you must call it with get to subset in your data.table. This works:
plot.tj<- function(dat = tj, color = NULL) {
name <- names(dat)[2:4] # a factor of names to loop over
for (i in seq_along(name)) {
plotms <- ggline(dat[!is.na(get(color)),], x = "time", y = name[i], color = color)
print(plotms)
}
}
plot.tj(color = "Name")
Code:
library(plotly)
library(tidyverse)
df <- data.frame(protein = c("Chicken", "Beef", "Pork", "Fish",
"Chicken", "Beef", "Pork", "Fish"),
y1 = c(3, 24, 36, 49, 7, 15, 34, 49),
y2 = c(9, 28, 40, 47, 8, 20, 30, 40 ),
gender = c("Male", "Male", "Male", "Male",
"Female", "Female", "Female", "Female"))
df %>%
plot_ly() %>%
add_bars (y = ~y1, x = ~protein,
name = 'y1.male') %>% add_bars(y = ~y2,
x=~protein, color = I("green"),name = "y2.male")%>%
add_bars(y = ~y1, x = ~protein, color = I("black"),
name = 'y1.female') %>% add_bars(y = ~y2,
x=~protein, color = I("red"), name = "y2.female")
My desired result is to create something similar to this:
However when you run the code, you'll see that it has stacked the "Male" and "Female" values in each bar. I would like "y1.male" to represent the "Male" data when y = y1, "y2.male" to represent the "Male" data when y = y2, "y1.female" to represent the "Female" data when y = y1, and "y2.female" to represent the "Female" data when y = y2, respectively. How can I go about doing this without having to use filter by "transforms" in r-plotly?
We can rearrange the data to be in long format and then plot it:
df %>%
pivot_longer(cols = c(y1, y2)) %>%
unite(gender_var, c(gender, name)) %>%
plot_ly() %>%
add_bars (x = ~protein, y = ~value,
name = ~gender_var)
Does anyone know if it is possible to calculate a weighted mean in R when values are missing, and when values are missing, the weights for the existing values are scaled upward proportionately?
To convey this clearly, I created a hypothetical scenario. This describes the root of the question, where the scalar needs to be adjusted for each row, depending on which values are missing.
Image: Weighted Mean Calculation
File: Weighted Mean Calculation in Excel
Using weighted.mean from the base stats package with the argument na.rm = TRUE should get you the result you need. Here is a tidyverse way this could be done:
library(tidyverse)
scores <- tribble(
~student, ~test1, ~test2, ~test3,
"Mark", 90, 91, 92,
"Mike", NA, 79, 98,
"Nick", 81, NA, 83)
weights <- tribble(
~test, ~weight,
"test1", 0.2,
"test2", 0.4,
"test3", 0.4)
scores %>%
gather(test, score, -student) %>%
left_join(weights, by = "test") %>%
group_by(student) %>%
summarise(result = weighted.mean(score, weight, na.rm = TRUE))
#> # A tibble: 3 x 2
#> student result
#> <chr> <dbl>
#> 1 Mark 91.20000
#> 2 Mike 88.50000
#> 3 Nick 82.33333
The best way to post an example dataset is to use dput(head(dat, 20)), where dat is the name of a dataset. Graphic images are a really bad choice for that.
DATA.
dat <-
structure(list(Test1 = c(90, NA, 81), Test2 = c(91, 79, NA),
Test3 = c(92, 98, 83)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
w <-
structure(list(Test1 = c(18, NA, 27), Test2 = c(36.4, 39.5, NA
), Test3 = c(36.8, 49, 55.3)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
CODE.
You can use function weighted.mean in base package statsand sapply for this. Note that if your datasets of notes and weights are R objects of class matrix you will not need unlist.
sapply(seq_len(nrow(dat)), function(i){
weighted.mean(unlist(dat[i,]), unlist(w[i, ]), na.rm = TRUE)
})