I am working on the most recent TidyTuesday data and had an issue in my plot. New Jersey is shown above Nashville despite Nashville overall has more values. I am unsure how to fix this.
I think it has something to do with the one tweet by user etmckinley being sorted in Nashville first since it alphabetically comes before username sqlsekou. Perhaps there is a way to reverse the sorting and have it work correctly?
If not, how else can I order the data correctly to have Nashville above New Jersey?
library(tidyverse)
tweets <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-15/tweets.csv')
top_states <- tweets %>%
filter(
location != "iPhone: 34.704040,-86.722909",
location != "Kevin.Elder#GCSU.edu"
) %>%
drop_na(location) %>%
count(location, sort = TRUE) %>%
slice_max(n, n = 7) %>%
pull(location)
tweets %>%
filter(
location != "iPhone: 34.704040,-86.722909",
location != "Kevin.Elder#GCSU.edu"
) %>%
drop_na(location) %>%
count(location, username, sort = TRUE) %>%
filter(location %in% top_states) %>%
mutate(location = fct_reorder(location, n)) %>%
mutate(username = fct_reorder(username, -n)) %>%
ggplot(aes(n, location, fill = username)) +
geom_col() +
scale_fill_brewer(palette = "Set3") +
labs(
x = "Quantity of tweets",
y = "Location",
title = "Tweets by location over 3 month period",
subtitle = "Filled by username"
)
By default, fct_reorder reorders by the median value. Your Nashville bar has 2 components, one big, one small, and the median is half way inbetween. Your NJ bar has only one component, so the median is the full value. Override the default in fct_reorder by setting .fun = sum. See ?fct_reorder for more details.
Related
Suppose I have the following function in R:
my_function <- function(date1, date2, variable, quota, monthly_business_days) {
my_data %>%
filter(between(DATE, ymd(date1), ymd(date2))) %>%
summarize(total = sum({{variable}})) %>%
add_row(total = quota, .before = 1) %>%
rbind(.$total[[2]]/bizdays(date1, date2)*monthly_business_days)
if(variable == UNITS) {
%>% mutate(indicator = c("Quota (Units)", "Sales (Units)", "Forecast (Units)"))
} else {
%>% mutate(indicator = c("Quota (USD)", "Sales (USD)", "Forecast (USD)"))
}
}
Depending on the input on "variable", which refer to columns in "my_data" dataframe, I would like to proceed in different ways (see the if/else chunk). However, it is not possible to do this with dplyr (the only option would be writing everything separately, but I cannot afford to write that many lines of code). I'm currently working in a Shiny Application trying to reduce the lines of code with the use of functional programming.
Different posts here on StackOverflow have not given me the answer so far.
I would really appreciate any help. Thanks!
Try using the below function :
library(dplyr)
my_function <- function(date1, date2, variable, quota, monthly_business_days) {
value <- deparse(substitute(variable))
my_data %>%
filter(between(DATE, ymd(date1), ymd(date2))) %>%
summarize(total = sum({{variable}})) %>%
add_row(total = quota, .before = 1) %>%
rbind(.$total[[2]]/bizdays(date1, date2)*monthly_business_days) %>%
mutate(indicator = if(value == 'UNITS') c("Quota (Units)", "Sales (Units)", "Forecast (Units)")
else c("Quota (USD)", "Sales (USD)", "Forecast (USD)"))
}
I am required to re-create the visualization but filter the data to keep only the businesses that had terms less than 360 months.
The data I am using is the SBA data from this link:
https://amstat.tandfonline.com/doi/full/10.1080/10691898.2018.1434342
library('magrittr')
library(dplyr)
library(tidyr)
library(ggplot2)
sba2 <- sba_data %>%
mutate(default_binary = ifelse(MIS_Status=="CHGOFF","Paid in Full","Default"), daysterm = Term*30, xx = as.Date(sba_data$DisbursementDate, format="%Y-%m-%d") + daysterm, recession_binary = ifelse(xx >= "2007-12-01" & xx <="2009-06-30","Active during Recession","Not Active during Recession"), smaller_business_binary = ifelse(NoEmp < 30, "Very Small Biz", "Not Very Small Biz"), business_length = ifelse(Term < 360, "Short Business", "Long Business"))
table(sba2$business_length)
sba_3 <- sba2 %>%
group_by(recession_binary, default_binary) %>%
summarise(frequencies=n()) %>%
drop_na() %>%
mutate(percents = round(frequencies/sum(frequencies),2))
ggplot(data = sba_3 ) +
geom_col(mapping = aes(x = recession_binary, y = percents, fill = default_binary)) +
coord_flip() +
scale_fill_manual(breaks = c("Default", "Paid in Full"),
values=c(rgb(232/255,74/255,39/255), rgb(19/255,41/255,75/255))) +
scale_y_continuous(labels = scales::percent)
This is my code so far to recreate the visualization. However, I am unsure how to filter the data to only keep business with a term less than 360 months. I had created the variable business_length when mutating sba2, but am not sure what the next steps are. Any help would be greatly appreciated, thanks!
Something like this?
sba_3 <- sba2 %>%
filter(Term < 360) %>%
group_by(recession_binary, default_binary) %>%
summarise(frequencies=n()) %>%
drop_na() %>%
mutate(percents = round(frequencies/sum(frequencies),2))
The gt package lets users easily format cells based on conditional statements about the rows. I'm looking for a way to format each cell based on the value in the cell.
Here's what I mean. In the table below, I'd like to color each cell with S&P values by the value it contains.
library(gt)
library(dplyr)
library(tidyr)
# some arbitrary values of the S&P 500
jan08 <- sp500 %>%
filter(between(date, as.Date("2008-01-01"), as.Date("2008-01-15"))) %>%
select(date, open, high, low, close)
gt(jan08)
This function returns the appropriate color name for each value as a character string.
## this is the range of values
sp500.range <- jan08 %>% pivot_longer(cols = c(open, high, low, close))
heat_palette <- leaflet::colorNumeric(palette = "YlOrRd",
domain = sp500.range$value)
# For example:
> heat_palette(1411.88)
[1] "#FEB852"
Each cell can be colored manually, but this obviously isn't practical.
gt(jan08) %>%
tab_style(style = cell_fill(color = heat_palette(1411.88)),
locations = cells_body(columns = "open",
rows = (open == 1411.88)))
Is there a way to use the tab_style function to conditionally fill cells based on the value of the cell?
Create the gt object first and then loop over the sequence of rows in a for loop to color as the color argument in cell_fill takes a value of length 1
library(gt)
gtobj <- gt(jan08)
ht_values <- heat_palette(jan08$open)
for(i in seq_along(jan08$open)) {
gtobj <- gtobj %>%
tab_style(style = cell_fill(color = ht_values[i]),
locations = cells_body(columns = "open", rows = i))
}
gtobj
-output
EDIT:
This for loop can then be placed in a function like this.
fill_column <- function(gtobj, column){
ht_values <- heat_palette(jan08 %>% pull(sym(column)))
for(i in seq_along(jan08 %>% pull(sym(column)))){
gtobj <- gtobj %>%
tab_style(style = cell_fill(color = ht_values[i]),
locations = cells_body(columns = column, rows = i))
}
gtobj
}
Then, this function can be included with a pipe.
gt(jan08) %>%
fill_column("open") %>%
fill_column("high") %>%
fill_column("low") %>%
fill_column("close")
I have the following 3 way table I created in R.
with(dataset, ftable(xtabs(count ~ dos + sex + edu)))
The output looks like
edu high low medium unknown
dos sex
five-to-ten-years female 247776 44916 127133 23793
male 225403 37858 147821 20383
five-years-or-less female 304851 58018 182152 33649
male 253977 55720 193621 28972
more-than-ten-years female 709303 452605 539403 165675
male 629162 309193 689299 121336
native-born female 1988476 1456792 2094297 502153
male 1411509 1197395 2790522 395953
unknown female 57974 75480 73204 593141
male 40176 57786 93108 605542
I want to rename the variables and format the table so that I can include it in a report. I know that I can use dnn to rename the variables, but are there any other recommendations to rename the variables? And to format the table (similar to using kable)?
You could convert the output to a text matrix using the following function, after which you can style with kable however you choose:
ftab_to_matrix <- function(ft)
{
row_vars <- attr(ft, "row.vars")
for(i in seq_along(row_vars)){
row_vars[[i]] <- c(names(row_vars[i]), row_vars[[i]])}
rowvar_widths <- sapply(row_vars, function(x) max(nchar(x))) + 1
col_vars <- attr(ft, "col.vars")
rowvar_widths <- c(1, cumsum(c(rowvar_widths, max(nchar(names(col_vars))))))
ft_text <- capture.output(print(ft))
row_cols <- sapply(seq_along(rowvar_widths)[-1], function(x)
substr(ft_text, rowvar_widths[x - 1], rowvar_widths[x]))
ft_text <- substr(ft_text, rowvar_widths[length(rowvar_widths)] + 2, 100)
ft_breaks <- c(1, cumsum(lapply(strsplit(ft_text[length(ft_text)], "\\d "),
function(x) nchar(x) + 2)[[1]]))
col_cols <- sapply(seq_along(ft_breaks)[-1], function(x)
substr(ft_text, ft_breaks[x - 1], ft_breaks[x]))
trimws(cbind(row_cols, col_cols))
}
So, for example, using my example data from your last question, you could do something like:
my_tab <- with(`3waydata`, ftable(xtabs(count ~ duration + sex + education)))
as_image(kable_styling(kable(ftab_to_df(my_tab))), file = "kable.png")
Might have been easier had you given the full picture when you asked your first question... You could use gt to make fancy tables for reports. This is an edited version more fully demonstrating some capabilities.
library(dplyr)
library(gt)
way3data <- data %>%
group_by(duration, education, sex) %>%
summarise(count = sum(number)) %>%
ungroup
# Reorder with select and Titlecase with stringr
longer <- tidyr::pivot_wider(way3data,
values_from = count,
names_from = "education") %>%
select(duration, sex, high, medium, low, unknown) %>%
rename_with(stringr::str_to_title)
# Demonstrating some of the features of gt
# obviously could have done some of this
# to the original dataframe
myresults <- longer %>%
group_by(Duration) %>%
gt(rowname_col = "Sex") %>%
row_group_order(
groups = c("native-born",
"more-than-ten-years",
"five-to-ten-years",
"five-years-or-less",
"unknown")
) %>%
tab_spanner(label = "Education",
columns = matches("High|Low|Medium|Unknown")) %>%
tab_stubhead(label = "Duration or something") %>%
tab_style(
style = cell_text(style = "oblique", weight = "bold"),
locations = cells_row_groups()) %>%
tab_style(
style = cell_text(align = "right", style = "italic", weight = "bold"),
locations = cells_column_labels(
columns = vars(High, Low, Medium, Unknown)
)) %>%
tab_style(
style = cell_text(align = "right", weight = "bold"),
locations = cells_stub()) %>%
tab_header(
title = "Fancy table of counts with Duration, Education and Gender") %>%
tab_source_note(md("More information is available at https://stackoverflow.com/questions/62284264."))
# myresults
# Can save in other formats including .rtf
myresults %>%
gtsave(
"tab_1.png", expand = 10
)
You can read about all the formatting choices here
Data compliments of Allan
set.seed(69)
data <- data.frame(education = sample(c("high","low","medium","unknown"), 600, T),
sex = rep(c("Male", "Female"), 300),
duration = sample(c("unknown", "native-born",
"five-years-or-less", "five-to-ten-years",
"more-than-ten-years"), 600, T),
number = rpois(600, 10))
This is an assignment that I have to boxplot() but I somehow got the data squeezed. I'm new to R :(
I guess the problem is because the x axis labels are too long and not placed vertically, so I've tried and failed (based on this Inserting labels in box plot in R on a 45 degree angle?)
examples <- read.csv("mov.development.csv", sep="\t")
library(dplyr)
movies_rated_67_times <- examples %>%
group_by(movie) %>%
summarize(count=n(), avg_rating=mean(rating))%>%
filter(count == 67)
boxplot_data <- examples %>%
filter(movie %in% movies_rated_67_times$movie) %>%
select(title, rating)
boxplot(rating~title,
data=boxplot_data,
xlab="Title",
ylab="Rating",
xaxt = "n"
)
text(seq_along(boxplot_data$title), par("usr")[3] - 0.5, labels = names(boxplot_data$title), srt = 90, adj = 1, xpd = TRUE);
I want to have a plot like this
But I got this
But with a different type of labels that are not too long, normal code would work
Normal code:
examples <- read.csv("mov.development.csv", sep="\t")
library(dplyr)
movies_rated_67_times <- examples %>%
group_by(movie) %>%
summarize(count=n(), avg_rating=mean(rating))%>%
filter(count == 67)
boxplot_data <- examples %>%
filter(movie %in% movies_rated_67_times$movie) %>%
select(movie, rating)
boxplot(rating~movie,
data=boxplot_data,
xlab="Title",
ylab="Rating"
)
csv file: https://drive.google.com/file/d/1ODM7qdOVI2Sua7HMHGEfNdYz_R1jhGAD/view?usp=sharing
Transforming your title column from factor to character seems to fix it. Additionally I would insert line breaks into some of the movies names and reduce the text size so it fit's into the plot
boxplot_data <- examples %>%
filter(movie %in% movies_rated_67_times$movie) %>%
mutate(title = as.character(title)) %>%
select(title, rating)
boxplot_data[boxplot_data$title == "Adventures of Robin Hood, The (1938)",]$title <- "Adventures of Robin Hood,\nThe (1938)"
boxplot_data[boxplot_data$title == "Wallace & Gromit: The Best of Aardman Animation (1996)",]$title <- " Wallace & Gromit: The Best of\nAardman Animation (1996)"
boxplot_data[boxplot_data$title == "Bridges of Madison County, The (1995)",]$title <- "Bridges of Madison County,\nThe (1995)"
par(cex.axis = 0.7)
boxplot(rating~title,
data=boxplot_data,
xlab="Title",
ylab="Rating")