How not to get the data squeezed with boxplot? - r

This is an assignment that I have to boxplot() but I somehow got the data squeezed. I'm new to R :(
I guess the problem is because the x axis labels are too long and not placed vertically, so I've tried and failed (based on this Inserting labels in box plot in R on a 45 degree angle?)
examples <- read.csv("mov.development.csv", sep="\t")
library(dplyr)
movies_rated_67_times <- examples %>%
group_by(movie) %>%
summarize(count=n(), avg_rating=mean(rating))%>%
filter(count == 67)
boxplot_data <- examples %>%
filter(movie %in% movies_rated_67_times$movie) %>%
select(title, rating)
boxplot(rating~title,
data=boxplot_data,
xlab="Title",
ylab="Rating",
xaxt = "n"
)
text(seq_along(boxplot_data$title), par("usr")[3] - 0.5, labels = names(boxplot_data$title), srt = 90, adj = 1, xpd = TRUE);
I want to have a plot like this
But I got this
But with a different type of labels that are not too long, normal code would work
Normal code:
examples <- read.csv("mov.development.csv", sep="\t")
library(dplyr)
movies_rated_67_times <- examples %>%
group_by(movie) %>%
summarize(count=n(), avg_rating=mean(rating))%>%
filter(count == 67)
boxplot_data <- examples %>%
filter(movie %in% movies_rated_67_times$movie) %>%
select(movie, rating)
boxplot(rating~movie,
data=boxplot_data,
xlab="Title",
ylab="Rating"
)
csv file: https://drive.google.com/file/d/1ODM7qdOVI2Sua7HMHGEfNdYz_R1jhGAD/view?usp=sharing

Transforming your title column from factor to character seems to fix it. Additionally I would insert line breaks into some of the movies names and reduce the text size so it fit's into the plot
boxplot_data <- examples %>%
filter(movie %in% movies_rated_67_times$movie) %>%
mutate(title = as.character(title)) %>%
select(title, rating)
boxplot_data[boxplot_data$title == "Adventures of Robin Hood, The (1938)",]$title <- "Adventures of Robin Hood,\nThe (1938)"
boxplot_data[boxplot_data$title == "Wallace & Gromit: The Best of Aardman Animation (1996)",]$title <- " Wallace & Gromit: The Best of\nAardman Animation (1996)"
boxplot_data[boxplot_data$title == "Bridges of Madison County, The (1995)",]$title <- "Bridges of Madison County,\nThe (1995)"
par(cex.axis = 0.7)
boxplot(rating~title,
data=boxplot_data,
xlab="Title",
ylab="Rating")

Related

fct_reorder/ggplot2 not ordering as desired

I am working on the most recent TidyTuesday data and had an issue in my plot. New Jersey is shown above Nashville despite Nashville overall has more values. I am unsure how to fix this.
I think it has something to do with the one tweet by user etmckinley being sorted in Nashville first since it alphabetically comes before username sqlsekou. Perhaps there is a way to reverse the sorting and have it work correctly?
If not, how else can I order the data correctly to have Nashville above New Jersey?
library(tidyverse)
tweets <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-15/tweets.csv')
top_states <- tweets %>%
filter(
location != "iPhone: 34.704040,-86.722909",
location != "Kevin.Elder#GCSU.edu"
) %>%
drop_na(location) %>%
count(location, sort = TRUE) %>%
slice_max(n, n = 7) %>%
pull(location)
tweets %>%
filter(
location != "iPhone: 34.704040,-86.722909",
location != "Kevin.Elder#GCSU.edu"
) %>%
drop_na(location) %>%
count(location, username, sort = TRUE) %>%
filter(location %in% top_states) %>%
mutate(location = fct_reorder(location, n)) %>%
mutate(username = fct_reorder(username, -n)) %>%
ggplot(aes(n, location, fill = username)) +
geom_col() +
scale_fill_brewer(palette = "Set3") +
labs(
x = "Quantity of tweets",
y = "Location",
title = "Tweets by location over 3 month period",
subtitle = "Filled by username"
)
By default, fct_reorder reorders by the median value. Your Nashville bar has 2 components, one big, one small, and the median is half way inbetween. Your NJ bar has only one component, so the median is the full value. Override the default in fct_reorder by setting .fun = sum. See ?fct_reorder for more details.

Text transform issue inserting image in a gt table

I had previously asked about inserting images in gt tables here and gotten a lot of help. But I am now encountering a new issue. The difficult part is that I am having trouble creating a minimal example.
Consider this code from the answer to my original post
library(gt)
library(magrittr)
library(purrr)
library(ggplot2)
# Let's make some pngs
mtcars %>%
split(.$cyl) %>%
map(~ ggplot(.x, aes(hp, mpg, color = factor(gear))) + geom_point()) %>%
set_names(c("CA", "UT", "OH")) %>%
iwalk(~ ggsave(paste0(.y, "_test.png"), .x))
column_one <- c("CA", "UT", "OH")
# Put the filenames in the column
column_two <- c("CA", "UT", "OH")
dashboard.data <- data.frame(column_one, column_two, stringsAsFactors = FALSE)
names(dashboard.data)[1] <- "State"
names(dashboard.data)[2] <- "IncidenceGauge"
dboard3 <- dashboard.data %>%
gt() %>%
tab_header(
title = md("**Big Title**"),
subtitle = md("*Subtitle*")
) %>%
text_transform(
locations = cells_body(vars(IncidenceGauge)),
fn = function(x) {
# loop over the elements of the column
map_chr(x, ~ local_image(
filename = paste0(.x, "_test.png"),
height = 100
))
}) %>%
cols_label(IncidenceGauge = "Risk Level")
print(dboard3)
It runs fine and produces this image
Now, imagine that I want to create a header row, and in that row there is no image in column two. In my case I create an image called NA_test.png that is just a blank white square. For the example below, you'll have to image that's what NA_test.png is and not a plot from mtcars. You'll see that column two now begins with NA...
library(gt)
library(magrittr)
library(purrr)
library(ggplot2)
# Let's make some pngs
mtcars %>%
split(.$cyl) %>%
map(~ ggplot(.x, aes(hp, mpg, color = factor(gear))) + geom_point()) %>%
set_names(c("NA", "UT", "OH")) %>%
iwalk(~ ggsave(paste0(.y, "_test.png"), .x))
column_one <- c("Header", "UT", "OH")
# Put the filenames in the column
column_two <- c(NA, "UT", "OH")
dashboard.data <- data.frame(column_one, column_two, stringsAsFactors = FALSE)
names(dashboard.data)[1] <- "State"
names(dashboard.data)[2] <- "IncidenceGauge"
dboard3 <- dashboard.data %>%
gt() %>%
tab_header(
title = md("**Big Title**"),
subtitle = md("*Subtitle*")
) %>%
text_transform(
locations = cells_body(vars(IncidenceGauge)),
fn = function(x) {
# loop over the elements of the column
map_chr(x, ~ local_image(
filename = paste0(.x, "_test.png"),
height = 100
))
}) %>%
cols_label(IncidenceGauge = "Risk Level")
print(dboard3)
This runs as well and produces this...again you'll just have to imagine that next to Header NA_test.png is just a blank white image.
My issue is that in my script (much longer, but literally copied in form from this example) when I encounter NA's it seems as if it's treating them like blanks rather then substituting NA. I get error messages that R can't find _test.png. Not NA_test.png as I would expect it to look for, but _test.png.
Here's the dataframe that is passed to gt()
There is no call to text_transform() for the Base column, but there is for the CommunityNewCases column and that's where the error occurs.
Can anyone suggest any reason as to why this is happening?
How to handle NA fields is important to dashboard design. So the real question is - should a dashboard really aggregate a plot of the entries which have "NA" as location? NA is typically a data entry error, so isn't the plot misleading?
As a statistical language, R considers the logical NA as a blank or empty data field. This is by design. And it is powerful, because NA retains its meaning in both character and numeric fields.
To change R's standard behavior, simply reassign an alternative value to the logical NA. For example, your sample code can replace logical NA with the character "NA" to use it in a paste command.
column_two <- c("NA", "UT", "OH")
Or if your data is more complex than this toy example, replace NA with "NA" using tidyr library.
dboard3 <- dashboard.data %>%
tidyr::replace_na("NA") %>%
gt() %>%
...
Alternatively, you could retain the logical NA, but handle with ifelse and if.na() test as shown here...
dboard3 <- dashboard.data %>%
gt() %>%
tab_header(
title = md("**Big Title**"),
subtitle = md("*Subtitle*")
) %>%
text_transform(
locations = cells_body(vars(IncidenceGauge)),
fn = function(x) {
# loop over the elements of the column
map_chr(x, ~ local_image(
filename = ifelse(is.na(.x),
"NA_test.png",
paste0(.x, "_test.png")),
height = 100
))
}) %>%
cols_label(IncidenceGauge = "Risk Level")

Highcharts/HighcharteR: Stacked Column Bars Get Skinny After 50+ Series in R

I am having a problem with skinny bars in a stacked column chart in highcharter in R. I have created a repo of the code below and am wondering if someone could help me avoid the issue of having the bars get super skinny when I add more than 49 series to the graph.
Any ideas or work arounds would be greatly appreciated.
As always, thank you in advance.
Best,
Nate
library(highcharter)
library(magrittr)
library(viridisLite)
dfmtx<- as.data.frame.matrix(matrix(data = abs(rnorm(n=20*50, mean = 0, sd=1)), ncol = 50))
dfmtx<- dfmtx/rowSums(dfmtx)
df<- data.frame(date=seq.Date(from = as.Date("2001-01-01"), to = Sys.Date(), by="years")[1:20],
dfmtx, stringsAsFactors = F)
hc<- highcharter::highchart() %>%
highcharter::hc_chart(type = "column") %>%
#highcharter::hc_plotOptions(column=list(pointWidth=45, pointPadding=0, groupPadding=0.1, padding=0)) %>%
highcharter::hc_plotOptions(column = list(stacking = "normal"), series=list(cropThreshold=200)) %>%
highcharter::hc_xAxis(categories = df$date, title=list(text="Fake Date")) %>%
highcharter::hc_title(text=paste0("Bars Get Skinny When You Add 50"))
for(i in 2:50){ # Smiles...This Works!
#for(i in 2:51){ # Tears..skinny bars :(
the_name<- colnames(df)[i]
hc<- hc %>%
highcharter::hc_add_series(name=the_name,
data = df[,i],
stack = "SameStack")
}
# Pretty colors...why not?
cols<- viridisLite::viridis(n=length(hc$x$hc_opts$series))
cols<- base::substr(cols, 0,7)
hc<- hc %>%
highcharter::hc_yAxis(title=list(text="Proportion"), max=1) %>%
highcharter::hc_colors(cols) %>%
highcharter::hc_legend(align="center")
hc
You can set the width and height of the chart using hc_size().
for(i in 2:51){ # Tears..skinny bars :(
the_name<- colnames(df)[i]
hc<- hc %>%
highcharter::hc_add_series(name=the_name,
data = df[,i],
stack = "SameStack")
}
# Pretty colors...why not?
cols<- viridisLite::viridis(n=length(hc$x$hc_opts$series))
cols<- base::substr(cols, 0,7)
hc<- hc %>%
highcharter::hc_yAxis(title=list(text="Proportion"), max=1) %>%
highcharter::hc_colors(cols) %>%
highcharter::hc_legend(align="center") %>%
highcharter::hc_size(height = 800) #Setting chart height to 800.
hc

Formatting an ftable in R

I have the following 3 way table I created in R.
with(dataset, ftable(xtabs(count ~ dos + sex + edu)))
The output looks like
edu high low medium unknown
dos sex
five-to-ten-years female 247776 44916 127133 23793
male 225403 37858 147821 20383
five-years-or-less female 304851 58018 182152 33649
male 253977 55720 193621 28972
more-than-ten-years female 709303 452605 539403 165675
male 629162 309193 689299 121336
native-born female 1988476 1456792 2094297 502153
male 1411509 1197395 2790522 395953
unknown female 57974 75480 73204 593141
male 40176 57786 93108 605542
I want to rename the variables and format the table so that I can include it in a report. I know that I can use dnn to rename the variables, but are there any other recommendations to rename the variables? And to format the table (similar to using kable)?
You could convert the output to a text matrix using the following function, after which you can style with kable however you choose:
ftab_to_matrix <- function(ft)
{
row_vars <- attr(ft, "row.vars")
for(i in seq_along(row_vars)){
row_vars[[i]] <- c(names(row_vars[i]), row_vars[[i]])}
rowvar_widths <- sapply(row_vars, function(x) max(nchar(x))) + 1
col_vars <- attr(ft, "col.vars")
rowvar_widths <- c(1, cumsum(c(rowvar_widths, max(nchar(names(col_vars))))))
ft_text <- capture.output(print(ft))
row_cols <- sapply(seq_along(rowvar_widths)[-1], function(x)
substr(ft_text, rowvar_widths[x - 1], rowvar_widths[x]))
ft_text <- substr(ft_text, rowvar_widths[length(rowvar_widths)] + 2, 100)
ft_breaks <- c(1, cumsum(lapply(strsplit(ft_text[length(ft_text)], "\\d "),
function(x) nchar(x) + 2)[[1]]))
col_cols <- sapply(seq_along(ft_breaks)[-1], function(x)
substr(ft_text, ft_breaks[x - 1], ft_breaks[x]))
trimws(cbind(row_cols, col_cols))
}
So, for example, using my example data from your last question, you could do something like:
my_tab <- with(`3waydata`, ftable(xtabs(count ~ duration + sex + education)))
as_image(kable_styling(kable(ftab_to_df(my_tab))), file = "kable.png")
Might have been easier had you given the full picture when you asked your first question... You could use gt to make fancy tables for reports. This is an edited version more fully demonstrating some capabilities.
library(dplyr)
library(gt)
way3data <- data %>%
group_by(duration, education, sex) %>%
summarise(count = sum(number)) %>%
ungroup
# Reorder with select and Titlecase with stringr
longer <- tidyr::pivot_wider(way3data,
values_from = count,
names_from = "education") %>%
select(duration, sex, high, medium, low, unknown) %>%
rename_with(stringr::str_to_title)
# Demonstrating some of the features of gt
# obviously could have done some of this
# to the original dataframe
myresults <- longer %>%
group_by(Duration) %>%
gt(rowname_col = "Sex") %>%
row_group_order(
groups = c("native-born",
"more-than-ten-years",
"five-to-ten-years",
"five-years-or-less",
"unknown")
) %>%
tab_spanner(label = "Education",
columns = matches("High|Low|Medium|Unknown")) %>%
tab_stubhead(label = "Duration or something") %>%
tab_style(
style = cell_text(style = "oblique", weight = "bold"),
locations = cells_row_groups()) %>%
tab_style(
style = cell_text(align = "right", style = "italic", weight = "bold"),
locations = cells_column_labels(
columns = vars(High, Low, Medium, Unknown)
)) %>%
tab_style(
style = cell_text(align = "right", weight = "bold"),
locations = cells_stub()) %>%
tab_header(
title = "Fancy table of counts with Duration, Education and Gender") %>%
tab_source_note(md("More information is available at https://stackoverflow.com/questions/62284264."))
# myresults
# Can save in other formats including .rtf
myresults %>%
gtsave(
"tab_1.png", expand = 10
)
You can read about all the formatting choices here
Data compliments of Allan
set.seed(69)
data <- data.frame(education = sample(c("high","low","medium","unknown"), 600, T),
sex = rep(c("Male", "Female"), 300),
duration = sample(c("unknown", "native-born",
"five-years-or-less", "five-to-ten-years",
"more-than-ten-years"), 600, T),
number = rpois(600, 10))

Problem in plotting a map graph using highcharter

I am having troubles trying to write R code for a choroplet using the highcharter package. I am trying to replicate the code in the following link on lines 84-112: https://www.kaggle.com/gloriousc/global-terrorism-in-1970-2016/code.
I have been encountering 2 errors:
When running line 95, error says that there is no object called "countrycode_data". I looked on the internet in order to find out what countrycode_data is and I discovered that it is a dataset of the containing country code to associate to country names in datasets. Countrycode_data, from what I understood, it should have been contained in the "countrycode" package that I had installed but I didn't manage to find out how to access this dataset. In order to overcome this problem i downloaded this dataset from the internet and managed to go on with the code.
When running the choroplet code starting on line 103, I encountered the following error: "Error: %in%(x = tail(joinBy, 1), table = names(df)) is not TRUE". I actually have no idea about what this error could mean, so I'm here asking for help.
I managed to overcome the 1st error problem even though I am not sure that it is the correct way.
I am going to leave the entire code right here:
knitr::opts_chunk$set(echo=TRUE, error=FALSE)
library(dplyr) #manipulate table
library(ggplot2) #visualization
library(highcharter) #making map
library("viridisLite") #Default Color Maps
library(countrycode) #list of country code
library(treemap) #make a treemap chart
library(reshape2) #melt function
library(plotly) #pie chart
library(tm) #text mining
library(SnowballC) #stemming text
library(wordcloud) #make a text chart
library(RColorBrewer) #make a color pallette
library(DT) #make datatable
#input the data
terror <- read.csv("../input/globalterrorismdb_0617dist.csv")
Terrorist Incidents Map
#count terrorism incidents per country as a dataframe
countries <- terror %>%
group_by(country_txt) %>%
summarise(Total = round(n()))
#Making a terrorism map
#Credit to umeshnarayanappa
names(countries) <- c("country.name", "total") #change the column name
countries$iso3 <- countrycode_data[match(countries$country.name, countrycode_data$country.name.en), "iso3c"] #add iso3 column from country_code
data(worldgeojson, package = "highcharter")
dshmstops <- data.frame(q = c(0, exp(1:5)/exp(5)),
c = substring(viridis(5 + 1, option = "D"), 0, 7)) %>% #from viridisLite, make a color
list_parse2() #from highchart package, parse df to list
highchart() %>% #from highchart package
hc_add_series_map(worldgeojson, countries, value = "total", joinBy = "iso3") %>%
hc_colorAxis(stops = dshmstops) %>%
hc_legend(enabled = TRUE) %>%
hc_add_theme(hc_theme_db()) %>%
hc_mapNavigation(enabled = TRUE) %>%
hc_title(text = "Global Terrorism in 1970-2016", style = list(fontSize = "25px")) %>%
hc_add_theme(hc_theme_google()) %>%
hc_credits(enabled = TRUE, text = "Sources: National Consortium for the Study of Terrorism and Responses to Terrorism (START)", style = list(fontSize = "10px"))
I want to specify that, even though I ctrl+c ctrl+v the lines, they are not working for me.
Thank you for reading everything and also, I hope, for your help.
I tried to replicate the example. I hope the following is enough for you to work by yourself and replicate the example. It seems that countrycode_data is on the psData package. This package requires the rJava package, which is not on my machine now. As you were looking for a workaround, I found my own way; I scrape country data including iso3. (You can probably use the ISOcodes package too.) You need to check if country names in the two datasets are identical or not, which is a common challenge. You usually see some mismatches. I do not have time to correct all, but I showed you how to revise some country names in recode(). The bottom line is that you want to add iso3 to countries. So you need to make sure that you have identical country names as much as possible. (Obviously, some countries do not exist any more. You cannot really do anything about them.) The author used match() in his code, but I rather used left_join() to do the same. After this, I think you are ready to follow the rest of the code. Note that hc_add_series_map() is also doing a join process. worldgeojson has a property called iso3. countries must have a column called iso3. Otherwise, you will get the same error message again.
library(tidyverse)
library(data.table)
library(rvest)
library(highcharter)
library(viridisLite)
# I used fread(). This is much faster.
terror <- fread("globalterrorismdb_0919dist.csv")
# I wrote my own code which does the same job.
count(terror, country_txt) %>%
setNames(nm = c("country.name", "total")) -> countries
# Get iso3 data
map_dfc(.x = c("official", "shortname", "iso3"),
.f = function(x) {read_html("http://www.fao.org/countryprofiles/iso3list/en/") %>%
html_nodes(paste("td.", x, sep = "")) %>%
html_text() %>%
gsub(pattern = "\\n(\\s+)?", replacement = "")}) %>%
setNames(nm = c("official", "shortname", "iso3")) -> iso3
# Revise some country names.
mutate(iso3, shortname = trimws(sub(x = shortname, pattern = "\\(.*\\)",
replacement = "")),
shortname = recode(.x = shortname,
`Bosnia and Herzegovina` = "Bosnia-Herzegovina",
`Brunei Darussalam` = "Brunei",
Czechia = "Czech Republic",
Congo = "Republic of the Congo",
`Côte d'Ivoire` = "Ivory Coast",
`Russian Federation` = "Russia",
`United Kingdom of Great Britain and Northern Ireland` = "United Kingdom",
`United States of America`= "United States"
)) -> iso3
# Join the two data sets
left_join(countries, iso3, by = c("country.name" = "shortname")) -> countries
data(worldgeojson, package = "highcharter")
dshmstops <- data.frame(q = c(0, exp(1:5)/exp(5)),
c = substring(viridis(5 + 1, option = "D"), 0, 7)) %>% #from viridisLite, make a color
list_parse2()
highchart() %>% #from highchart package
hc_add_series_map(worldgeojson, df = countries,
value = "total", joinBy = "iso3") %>%
hc_colorAxis(stops = dshmstops) %>%
hc_legend(enabled = TRUE) %>%
hc_add_theme(hc_theme_db()) %>%
hc_mapNavigation(enabled = TRUE) %>%
hc_title(text = "Global Terrorism in 1970-2016", style = list(fontSize = "25px")) %>%
hc_add_theme(hc_theme_google()) %>%
hc_credits(enabled = TRUE,
text = "Sources: National Consortium for the Study of Terrorism and Responses to Terrorism (START)",
style = list(fontSize = "10px"))

Resources