So, I have this data I'm trying to format into a nice table. Currently, I've just been using the kable() command from the knitr pacakge, but I am trying to learn how to make nice and pretty tables that look more professional. I have the following code:
library(gt)
library(tidyverse)
library(glue)
Player <- c("Russel Westbrook", "James Harden", "Kawhi Leonard", "Lebron James",
"Isaiah Thomas", "Stephen Curry", "Giannis Antetokounmpo", "John Wall",
"Anthony Davis", "Kevin Durant")
Overall_proportion <- c(0.845, 0.847, 0.880, 0.674, 0.909, # q-the ratio of clutch makes
0.898, 0.770, 0.801, 0.802, 0.875) # by clutch attempts
Clutch_makes <- c(64, 72, 55, 27, 75, # Y-values
24, 28, 66, 40, 13)
Clutch_attempts <- c(75, 95, 63, 39, 83, # Clutch_attempts -values
26, 41, 82, 54, 16)
NBA_stats <- as.data.frame(cbind(Player, Overall_proportion, Clutch_makes, Clutch_attempts))
# creating the various quartiles for the posterior distributions
q25 <- qbeta(0.250, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
q50 <- qbeta(0.500, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
q75 <- qbeta(0.750, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
q90 <- qbeta(0.900, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
q_low <- qbeta(0.025, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
q_high <- qbeta(0.975, Clutch_makes + 1, Clutch_attempts - Clutch_makes + 1)
Player_distribution_table <- cbind(q25, q50, q75, q90, q_low, q_high)
rownames(Player_distribution_table) <- Player
I'm just trying to turn this into a table where the row names are those of the players, and the column names are "25th percentile, 50th percentile" etc.
Thank you!
gt needs a data.frame or tibble object. Player_distribution_table is a matrix (because you used cbind). You can pass dataframe to gt function with rownames_to_stub = TRUE to get player names.
Player_distribution_table <- data.frame(q25, q50, q75, q90, q_low, q_high)
rownames(Player_distribution_table) <- Player
gt::gt(Player_distribution_table, rownames_to_stub = TRUE)
Related
Here is sample data where ID is a categorical variable.
ID <- c(12, 34, 560, 45, 235)
W1 <- c(0, 5, 7, 6, 0)
W2 <- c(7, 8, 9, 5, 2)
W3 <- c(0, 0, 3, 5, 9)
df <- data.frame(ID, W1, W2, W3)
df$ID <- as.factor(df$ID)
I want to draw five bar plots for each of these IDs using the frequency data for the three weeks W1:W3. In the actual dataset, I have 30+ weeks and around 150 IDs, hence the intention here is to do this efficiently. Nothing fancy, but ggplot would be ideal as I would need to manipulate some aesthetics.
How to do this using loop and save the images in one file(pdf)?
Thanks for your help!
This sort of problem is usually a data reformating problem. See reshaping data.frame from wide to long format. After reshaping the data, the plot is faceted by ID, avoiding loops.
library(ggplot2)
ID <- c(12, 34, 560, 45, 235)
W1 <- c(0, 5, 7, 6, 0)
W2 <- c(7, 8, 9, 5, 2)
W3 <- c(0, 0, 3, 5, 9)
df <- data.frame(ID, W1, W2, W3)
df$ID <- as.factor(df$ID)
df[-1] <- lapply(df[-1], as.integer)
df |>
tidyr::pivot_longer(-ID, names_to = "Week", values_to = "Frequency") |>
ggplot(aes(Week, Frequency, fill = Week)) +
geom_col() +
scale_y_continuous(breaks = scales::pretty_breaks()) +
facet_wrap(~ ID) +
theme_bw(base_size = 16)
Created on 2022-09-30 with reprex v2.0.2
Edit
If there is a mix of week numbers with 1 and 2 digits, the lexicographic order is not the numbers' order. For instance, after W1 comes W11, not W2. Package stringr function str_sort sorts by numbers when argument numeric = TRUE.
In the example below I reuse the data changing W2 to W11. The correct bars order should therefore be W1, W3, W11.
library(ggplot2)
library(stringr)
ID <- c(12, 34, 560, 45, 235)
W1 <- c(0, 5, 7, 6, 0)
W11 <- c(7, 8, 9, 5, 2)
W3 <- c(0, 0, 3, 5, 9)
df <- data.frame(ID, W1, W11, W3)
df$ID <- as.factor(df$ID)
df[-1] <- lapply(df[-1], as.integer)
df |>
tidyr::pivot_longer(-ID, names_to = "Week", values_to = "Frequency") |>
dplyr::mutate(Week = factor(Week, levels = str_sort(unique(Week), numeric = TRUE))) |>
ggplot(aes(Week, Frequency, fill = Week)) +
geom_col() +
scale_y_continuous(breaks = scales::pretty_breaks()) +
facet_wrap(~ ID) +
theme_bw(base_size = 16)
Created on 2022-10-01 with reprex v2.0.2
How to make wordcloud plot based on two columns values?
I have a dataframe as follows:
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina", "Vikram", "Ramesh", "Luther")
Age <- c(23, 41, 32, 58, 26, 41, 32, 58)
Pval <- c(0.01, 0.06, 0.001, 0.002, 0.025, 0.05, 0.01, 0.0002)
df <- data.frame(Name, Age, Pval)
I want to make wordcloud plot for df$Name based on values in df$Age and df$Pval. I used following code:
library("tm")
library("SnowballC")
library("wordcloud")
library("wordcloud2")
library("RColorBrewer")
set.seed(1234)
wordcloud(words = df$Name, freq = df$Age, min.freq = 1,
max.words=10, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Here Luther & Ben are of same size, but I need to make Luther to be slightly bigger than Ben as it has lower Pval.
A quick fix workaround:
library("dplyr")
library("scales")
library("wordcloud")
library("RColorBrewer")
Name <- c("Jon", "Bill", "Maria", "Ben", "Tina", "Vikram", "Ramesh", "Luther")
Age <- c(23, 41, 32, 58, 26, 41, 32, 58)
Pval <- c(0.01, 0.06, 0.001, 0.002, 0.025, 0.05, 0.01, 0.0002)
df <- data.frame(Name, Age, Pval)
df <- df %>%
group_by(Age) %>%
mutate(rank = rank(Pval)) %>% #rank pvalue by age
mutate(weight = scales::rescale(rank/max(rank), to=c(0,1))) %>%
#this is just to make sure that we don't add more than one to the mix
mutate(weight = Age + (1-weight) ) #because rank is inversed
#the final thing adds 0.5 if there is not anyone with the same age and 1 if
#there is someone else but you have a smaller p-val (it also should work if
# there is more than 2 person with the same age)
set.seed(1234)
wordcloud(words = df$Name, freq = df$weight, min.freq = 1,
max.words=10, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Fun and interesting question btw
I have a dataframe, the head of which looks like this:
|trackName | week| sum|
|:--------------------|----:|---:|
|New Slang | 1| 493|
|You're Somebody Else | 1| 300|
|Mushaboom | 1| 297|
|San Luis | 1| 296|
I am interested in plotting a line graph for each of the 346 unique trackNames in the dataframe, with week on the x-axis and sum on the y-axis. To automate this process, I wrote the following function:
charts <- function(df) {
songs <- df
lim <- nrow(songs)
x <- 1
song_names <- as_tibble(unique(songs$trackName))
while (x <= lim) {
song <- song_names[x, 1]
plot.name <- paste(paste(song), "plot.png", sep = "_")
songs %>% filter(trackName == paste(song[x, 1])) %>%
ggplot(., aes(x = week, y = sum), group = 1) +
geom_line() +
labs(
x = "Week",
y = "Sum of Listens",
title = paste("Week by Week Listening Interest for", song, sep = " "),
subtitle = "Calculated by plotting the sum of percentages of the song listened per week, starting from first listen"
) +
ggsave(plot.name,
width = 20,
height = 15,
units = "cm")
x <- x + 1
}
}
However when I run charts(df), only the following error shows up and then it quits:
> charts(mini)
geom_path: Each group consists of only one observation. Do you need to
adjust the group aesthetic?
>
What am I doing wrong here and what does this error mean?
A sample of the dataframe in DPUT format:
structure(list(trackName = c("New Slang", "You're Somebody Else",
"Mushaboom", "San Luis", "The Trapeze Swinger", "Flightless Bird, American Mouth",
"tere bina - Acoustic", "Only for a Moment", "Upward Over the Mountain",
"Virginia May", "Never to Be Forgotten Kinda Year", "Little Talks",
"Jhak Maar Ke", "Big Rock Candy Mountain", "Sofia", "Aaoge Tum Kabhi",
"Deathcab", "Dil Mere", "Choke", "Phir Le Aya Dil", "Lucille",
"tere bina - Acoustic", "Dil Mere", "Only for a Moment", "This Is The Life",
"San Luis", "Main Bola Hey!", "Choo Lo", "Yeh Zindagi Hai", "Aaftaab",
"Never to Be Forgotten Kinda Year", "Khudi", "Flightless Bird, American Mouth",
"Mere Bina", "Simple Song", "Dil Haare", "Dil Hi Toh Hai", "You're Somebody Else",
"Sofia", "Who's Laughing Now", "Main Bola Hey!", "Lucille", "Eenie Meenie",
"tere bina - Acoustic", "New Slang", "Aaftaab", "Mamma Mia",
"July", "Yeh Zindagi Hai", "Someone You Loved"), week = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3), sum = c(493, 300, 297, 296, 292, 234, 214,
200, 200, 197, 192, 187, 185, 181, 175, 172, 141, 119, 106, 103,
579, 574, 501, 462, 428, 378, 320, 307, 306, 301, 301, 300, 300,
300, 300, 300, 296, 294, 251, 242, 3534, 724, 696, 512, 479,
400, 302, 300, 300, 300)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
How about using purrr::walk instead?
library(tidyverse)
library(hrbrthemes)
walk(unique(songs$trackName),
~{ggsave(plot = ggplot(filter(songs, trackName == .x), aes(x = week, y = sum), group = 1) +
geom_line(color = ft_cols$yellow) +
labs(x = "Week", y = "Sum of Listens", title = paste("Week by Week Listening Interest for", .x, sep = " "),
subtitle = "Calculated by plotting the sum of percentages of the song listened per week, starting from first listen") +
theme_ft_rc(),
file = paste0(.x,"_plot.png"), width = 20, height = 15, units = "cm")})
Note: the question was subsequently edited to remove the hrbrthemes package requirement.
You can split the dataset for each trackName and create a png file for it.
library(tidyverse)
charts <- function(df) {
df %>%
group_split(trackName) %>%
map(~{
track <- first(.x$trackName)
ggplot(.x, aes(x = factor(week), y = sum, group = 1)) +
geom_line() +
labs(
x = "Week",
y = "Sum of Listens",
title = paste("Week by Week Listening Interest for", track),
subtitle = "Calculated by plotting the sum of percentages of the song listened per week, starting from first listen"
) -> plt
ggsave(paste0(track,'.png'), plt, width = 20, height = 15, units = "cm")
})
}
charts(songs)
I am learning R and want to manually compute the Mann-Whitney U statistic and p-value using a normal approximation (and not use wilcox.test or equivalent). My pensioner's brain struggles with coding so it has taken me hours to produce the same answers as the textbook. However, my code to sum the 'StateRank' for the state with the fewest values is convoluted. How can I replace the commented section with more efficient code? I've hunted high and low, both here and on Google, but I don't even know which search terms to use! It won't surprise me to hear that there is a one-line solution but I'm no nearer knowing what it is.
library(tidyverse)
# Activity 9: aboriginal village size in Alaska and California
a.df <- data.frame(
Alaska = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA),
Calif = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
) %>%
pivot_longer(
cols = c("Alaska", "Calif"),
names_to = "State",
values_to = "Value",
values_drop_na = TRUE
) %>%
mutate(StateRank = rank(Value, ties.method = "average"))
# clumsy code to sort, then sum ranks (StateRank) for group with fewest values (nA)
#--------------------------------------------------------------------------------
asc_or_desc <- as.matrix(count(a.df, State))
if (as.numeric(asc_or_desc[1,2])>as.numeric(asc_or_desc[2,2])) {
a.df <- arrange(a.df, desc(State))
} else {
a.df <- arrange(a.df, State)
}
#--------------------------------------------------------------------------------
nA <- as.numeric(min(count(a.df, State, sort = TRUE)$n))
nB <- as.numeric(max(count(a.df, State, sort = TRUE)$n))
a.U <- sum(a.df$StateRank[1:nA])
a.E <- (nA*(nA+nB+1))/2 # Expectation of U
a.V <- (nA*nB*(nA+nB+1))/12 # Variance of U
a.Z <- (a.U - a.E)/sqrt(a.V)
a.P <- round((1 - round(pnorm(round(abs(a.Z), 2),
mean = 0, sd = 1) ,4)) * 2, 3)
# all the rounding is to mimic statistical tables (so that
# the answer is the same as in the textbook that I use)
Please try this code and tell me if I am on the right way:
I replaced your so called clumsy code with this one
... %>%
group_by(State) %>%
mutate(mx = max(Value)) %>%
arrange(desc(mx), desc(Value)) %>%
select(-mx)
The whole code:
library(tidyverse)
# Activity 9: aboriginal village size in Alaska and California
a.df <- data.frame(
Alaska = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA),
Calif = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
) %>%
pivot_longer(
cols = c("Alaska", "Calif"),
names_to = "State",
values_to = "Value",
values_drop_na = TRUE
) %>%
mutate(StateRank = rank(Value, ties.method = "average")) %>%
group_by(State) %>%
mutate(mx = max(Value)) %>%
arrange(desc(mx), desc(Value)) %>%
select(-mx)
-----------------------------------------------------------------------------
a.U <- sum(a.df$StateRank[1:nA])
a.E <- (nA*(nA+nB+1))/2 # Expectation of U
a.V <- (nA*nB*(nA+nB+1))/12 # Variance of U
a.Z <- (a.U - a.E)/sqrt(a.V)
a.P <- round((1 - round(pnorm(round(abs(a.Z), 2),
mean = 0, sd = 1) ,4)) * 2, 3)
# all the rounding is to mimic statistical tables (so that
# the answer is the same as in the textbook that I use)
I have the following code for creating a boxplot in ggplot2:
throughput <- c(1, 2, 3, 4, 5)
response_time_min <- c(9, 19, 29, 39, 49)
response_time_10 <- c(50, 55, 60, 60, 61)
response_time_med <- c(100, 100, 100, 100, 100)
response_time_90 <- c(201, 201, 250, 200, 230)
response_time_max <- c(401, 414, 309, 402, 311)
df <- data.frame(throughput, response_time_min, response_time_10, response_time_med,response_time_90, response_time_max)
df
library(ggplot2)
g <- ggplot(df) +
geom_boxplot(aes(x=factor(throughput),ymax = response_time_max,upper = response_time_90,
y = response_time_med,
middle = response_time_med,
lower = response_time_10,
ymin = response_time_min), stat = "identity")
g
But now when I want to apply ggplotly(g) the graph does not render correctly. What can I do to make this work?
I don't think 90th percentile and 10th percentile can be done. Assuming they are q3 and q1, respectively, the code below
bp <- plot_ly(color=c("orange")) %>%
add_trace(lowerfence = response_time_min, q1 = response_time_10,
median = response_time_med, q3 = response_time_90,
upperfence = response_time_max, type = "box") %>%
layout(xaxis=list(title="throughput"),
yaxis=list(title="response_time"))
bp
gives the following output: