Wrong visual plot representation of a correlation

Wrong visual plot representation of a correlation - r

I built this code to check the correlation between Bitcoin and other altCoins, the table data is working fine, my only issue is when I try to plot the result to get a visual representation, I get the below result:
library(tidyverse)
library(tidyquant)
library(timetk)
library(tibbletime)
library(corrr)
symbols <- c("BTC-USD","ETH-USD","TRX-USD","EOS-USD","ADA-USD")
prices <- getSymbols(symbols,
src = 'yahoo',
from = "2019-09-01",
to = "2020-03-24",
auto.assign = TRUE,
warnings = FALSE) %>%
map(~Ad(get(.))) %>%
reduce(merge) %>%
`colnames<-`(symbols)
prices_monthly <- to.monthly(prices, indexAt = "last", OHLC = FALSE)
prices_monthly %>% correlate() %>% focus('BTC-USD') %>%
ggplot(aes(x = rowname, y = 'BTC-USD')) +
geom_bar(stat = "identity") +
ylab("Correlation with BTC-USD") +
xlab("Variable")
Plot result
What did I did wrong, please?

You need backticks around the variable name.
prices_monthly %>% correlate() %>% focus('BTC-USD') %>%
ggplot(aes(x = rowname, y=`BTC-USD`)) + # <- Here
geom_bar(stat="identity") +
ylab("Correlation with BTC-USD") +
xlab("Variable")

Related

How to convert to stat_summary_2d?

I am using 'flights' data set from 'nycflights13' package. I was required to convert to code containing 'stat_summary_2d' from the original code:
flights %>%
mutate(cancel = 1*(dep_time %>% is.na)) %>%
group_by(carrier, origin) %>%
summarise(cancel = 100*mean(cancel, na.rm = T)) %>%
ggplot() +
geom_tile(aes(origin, carrier, fill = cancel)) +
geom_text(aes(origin, carrier, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()
In the data set the missing value of the 'dep_time' variable means a cancelled flight, and the variable 'cancel' is created by calculating the proportion of cancelled flights over the scheduled flights.
Below is how I apply 'stat_summary_2d' to convert the original code:
flights %>%
mutate(cancel = 1*(dep_time %>% is.na)) %>%
ggplot() +
stat_summary_2d(aes(carrier, origin, z = cancel)) +
geom_text(data = flights %>% group_by(carrier, origin) %>%
summarize(cancel = 100*mean(cancel, na.rm = T)) %>% ungroup,
aes(factor(carrier), origin, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()
When I execute my code, the error is
> Error in summarize()`: ! Problem while computing `cancel = 100 *
> mean(cancel, na.rm = T)`. i The error occurred in group 1: carrier =
> "9E", origin = "EWR".
Could anyone tell me how to fix this problem? Thank you so much!

The original data is not updated when with the new column unless we use %<>% instead of %>%. But, it may be easier to create two objects
library(dplyr)
library(ggplot2)
flight1 <- flights %>%
mutate(cancel = 1*(dep_time %>% is.na))
flight2 <- flights1 %>%
group_by(carrier, origin) %>%
summarize(cancel = 100*mean(cancel, na.rm = TRUE), .groups = 'drop')
ggplot(flight1) +
stat_summary_2d(aes(carrier, origin, z = cancel)) +
geom_text(data = flight2, aes(factor(carrier),
origin, label = round(cancel,1)),
col = "blue", size = 5) +
scale_fill_distiller("Cancel Ratio", palette = "RdYlGn") +
theme_bw()

Creating a geographic file for use with tmap and coming up with error when coding shapefile

I am trying to reproduce a map I found here: http://zevross.com/blog/2018/10/02/creating-beautiful-demographic-maps-in-r-with-the-tidycensus-and-tmap-packages/
I am using RStudio and am running the following code:
library(ggplot2) # For plotting
library(tidycensus) # For downloading Census data
library(tmap) # For creating tmap
library(tmaptools) # For reading and processing spatial data related to tmap
library(dplyr) # For data wrangling
library(sf) # For reading, writing and working with spatial objects
census_api_key("enter your API key here", overwrite = TRUE)
dat12 <- get_acs("county", table = "B27001", year = 2012,
output = "tidy", state = NULL, geometry = FALSE) %>%
rename(`2012` = estimate) %>%
select(-NAME, -moe)
dat16 <- get_acs("county", table = "B27001", year = 2016,
output = "tidy", state = NULL, geometry = TRUE, shift_geo = TRUE) %>%
rename(`2016` = estimate) %>%
select(-moe)
dat <- left_join(dat16, dat12, by = c("GEOID", "variable"))
st_geometry(dat) <- NULL # This drops the geometry and leaves a table
head(dat)
dat <- mutate(dat,
cat = case_when(
variable %in% paste0("B27001_0",
c("09","12","37","40")) ~ "pop1834",
variable %in% paste0("B27001_0",
c("11","14","39","42")) ~ "pop1834ni")) %>%
filter(!is.na(cat))
# Create long version
dat <- tidyr::gather(dat, year, estimate, c(`2012`, `2016`))
# Group the data by our new categories and sum
dat <- group_by(dat, GEOID, NAME, year, cat) %>%
summarize(estimate = sum(estimate)) %>%
ungroup() %>%
tidyr::spread(cat, estimate)
dat <- mutate(dat, est = (pop1834ni/pop1834) * 100) %>%
select(-c(pop1834, pop1834ni)) %>%
tidyr::spread(year, est) %>%
mutate(diff = `2016`-`2012`)
head(dat)
datlong <- select(dat, -diff) %>%
tidyr::gather(year, estimate, c(`2012`, `2016`)) %>%
group_by(year) %>%
mutate(med = round(median(estimate, na.rm = TRUE), 1))
ggplot(datlong, aes(estimate)) +
geom_histogram(fill = "firebrick2",
color = "white", bins = 60) +
xlab("Uninsured adults ages 18-34 by county (%)") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(~year, ncol = 1) +
geom_vline(aes(xintercept = med,
group = year), lty = "dashed") +
geom_text(aes(label = paste("Median = ", med), x = med, y = 55))
d10 <- top_n(dat, 10, diff) %>%
mutate(type = "Insured population decreased",
difftemp = diff)
i10 <- top_n(dat, -10, diff) %>%
mutate(type = "Insured population increased",
difftemp = abs(diff))
id10 <- bind_rows(list(i10, d10)) %>%
arrange(desc(difftemp))
ggplot(id10) +
geom_col(aes(x = forcats::fct_reorder(NAME, difftemp),
y = difftemp, fill = type)) +
coord_flip() +
scale_fill_manual(values = c("firebrick2", "cyan4")) +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.title = element_blank()) +
ggtitle("Counties with the greatest change (+/-) in
insured population, ages 18-34, 2012-2016") +
ylab("Difference in % insured (2016 - 2012)") +
xlab("")
shp <- dat16 %>%
filter(variable == "B27001_001") # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff)
Up until the last bit of code, the one that begins with shp, everything runs perfect. Once,
shp <- dat16 %>%
filter(variable == "B27001_001") # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff)
is run, I get the following error:
Error in select(GEOID, NAME) : object 'GEOID' not found
I have checked dat16 and dat. GEOID and NAME are present there. I am not sure what is wrong with the SELECT function as I have not loaded another library which may interfere with it. Any help would be appreciated.

I see now what was missing, a %>% (pipe) following the 'filter':
shp <- dat16 %>%
filter(variable == "B27001_001") %>% # much faster than using distinct()
select(GEOID, NAME) %>%
left_join(dat, by = c("GEOID", "NAME")) %>%
arrange(GEOID) %>%
rename(
uninsured_2012 = `2012`,
uninsured_2016 = `2016`,
uninsured_diff = diff
)

ggridges with time series - R

I have a DF and I wanted to do a density graph with geom_density_ridges from ggridges, but, it's returning the same line in all states. What I'm doing wrong?
I would like to add trim = TRUE like in here, but it returns the following error message:
Ignoring unknown parameters: trim
My code:
library(tidyverse)
library(ggridges)
url <- httr::GET("https://xx9p7hp1p7.execute-api.us-east-1.amazonaws.com/prod/PortalGeral",
httr::add_headers("X-Parse-Application-Id" =
"unAFkcaNDeXajurGB7LChj8SgQYS2ptm")) %>%
httr::content() %>%
'[['("results") %>%
'[['(1) %>%
'[['("arquivo") %>%
'[['("url")
data <- openxlsx::read.xlsx(url) %>%
filter(is.na(municipio), is.na(codmun)) %>%
mutate_at(vars(contains(c("Acumulado", "Novos", "novos"))), ~ as.numeric(.))
data[,8] <- openxlsx::convertToDate(data[,8])
data <- data %>%
mutate(mortalidade = obitosAcumulado / casosAcumulado,
date = data) %>%
select(-data)
ggplot(data = data, aes(x = date, y = estado, heights = casosNovos)) +
geom_density_ridges(trim = TRUE)

You are probably not looking for density ridges but regular ridgelines.
There are a few choices to make in terms of normalisation. If you want to resemble densities, you can devide each group by their sum: height = casosNovos / sum(casosNovos). Next, you can decide that you want each ridge to be scaled to fit in between the lines, which you can do with the scales::rescale() function. It's your decision whether you want to do this per group or for the entire data. I chose the entire data below.
library(tidyverse)
library(ggridges)
url <- httr::GET("https://xx9p7hp1p7.execute-api.us-east-1.amazonaws.com/prod/PortalGeral",
httr::add_headers("X-Parse-Application-Id" =
"unAFkcaNDeXajurGB7LChj8SgQYS2ptm")) %>%
httr::content() %>%
'[['("results") %>%
'[['(1) %>%
'[['("arquivo") %>%
'[['("url")
data <- openxlsx::read.xlsx(url) %>%
filter(is.na(municipio), is.na(codmun)) %>%
mutate_at(vars(contains(c("Acumulado", "Novos", "novos"))), ~ as.numeric(.))
data[,8] <- openxlsx::convertToDate(data[,8])
data <- data %>%
mutate(mortalidade = obitosAcumulado / casosAcumulado,
date = data) %>%
select(-data) %>%
group_by(estado) %>%
mutate(height = casosNovos / sum(casosNovos))
ggplot(data = data[!is.na(data$estado),],
aes(x = date, y = estado, height = scales::rescale(height))) +
geom_ridgeline()

ggplot error using rmarkdown - object ´percent´ not found

So I have been making this scatterplot using ggplot in R.
By using this code as listed below in a regular r script in RStudio I am able to produce the plot that I want to without any errors.
The problem is when I am trying to use the same code in a chunk using rmarkdown to knit to PDF.
I get an error saying: Error in check_breaks_labels(breaks, labels): object percent not found.
Any suggestions? Hope the reproducable example is ok.
library(tidyquant)
library(timetk)
library(ggplot2)
SPY <- tq_get("SPY", from = '2010-01-01',
to = "2020-04-04",
get = "stock.prices")
FXI <- tq_get("FXI", from = '2010-01-01',
to = "2020-04-04",
get = "stock.prices")
QQQ <- tq_get("QQQ", from = '2010-01-01',
to = "2020-04-04",
get = "stock.prices")
SPY_monthly_returns <- SPY %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
col_rename = "SPY_ret")
FXI_monthly_returns <- FXI %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
col_rename = "FXI_ret")
QQQ_monthly_returns <- QQQ %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
col_rename = "QQQ_ret")
SPY_monthly_mean_ret <- SPY_monthly_returns %>%
select(SPY_ret) %>%
.[[1]] %>%
mean(na.rm = TRUE)
FXI_monthly_mean_ret <- FXI_monthly_returns %>%
select(FXI_ret) %>%
.[[1]] %>%
mean(na.rm = TRUE)
QQQ_monthly_mean_ret <- QQQ_monthly_returns %>%
select(QQQ_ret) %>%
.[[1]] %>%
mean(na.rm = TRUE)
SPY_monthly_sd_ret <- SPY_monthly_returns %>%
select(SPY_ret) %>%
.[[1]] %>%
sd()
FXI_monthly_sd_ret <- FXI_monthly_returns %>%
select(FXI_ret) %>%
.[[1]] %>%
sd()
QQQ_monthly_sd_ret <- QQQ_monthly_returns %>%
select(QQQ_ret) %>%
.[[1]] %>%
sd()
d <- data.frame(meanret = c(SPY_monthly_mean_ret,FXI_monthly_mean_ret,QQQ_monthly_mean_ret), sd = c(SPY_monthly_sd_ret,FXI_monthly_sd_ret,QQQ_monthly_sd_ret), names = c("SPY","FXI","QQQ"))
ggplot(d, aes(sd,meanret, color= ticker)) +
geom_point(size=1) + geom_text(aes(label=names)) +
ggtitle("Monthly Risk-Return Plot") + xlab("Volatility") +
ylab("Mean Return") + theme_bw() +
scale_y_continuous(label = percent, limits = c(0, 0.02)) +
scale_x_continuous(label = percent, limits = c(0, 0.08))

The solution was to require(scales) and after that specify "ticker" as I had not done that in my code chunk.
Both were suggested really quickly which is very helpful to me who is learning while writing my bachelor thesis. Thanks a lot!

You forgot to quote percent.
ggplot(d, aes(sd,meanret, color= ticker)) +
geom_point(size=1) + geom_text(aes(label=names)) +
ggtitle("Monthly Risk-Return Plot") + xlab("Volatility") +
ylab("Mean Return") + theme_bw() +
scale_y_continuous(label = "percent", limits = c(0, 0.02)) +
scale_x_continuous(label = "percent", limits = c(0, 0.08))
When not using quotes, ggplot is looking for an object named percent that should hold a string. Because of this, you could also do
p <- ggplot(...)
mylabel = "This is my label"
p + scale_x_continuous(label = "mylabel", limits = c(0, 0.08))

How can we data wrangling to obtain shown ratio/proportion chart shown

Goal is to produce a visualization indicating ratio.
Please help us how can we produce such ratio chart (high lighted) in R ?
library(tidyverse)
# Dataset creation
df <- data.frame(cls = c(rep("A",4),rep("B",4)),
grd = c("A1",rep("A2",3),rep(c("B1","B2"), 2)),
typ = c(rep("m",2),rep("o",2),"m","n",rep("p",2)),
pnts = c(rep(1:4,2)))
df
#### Data wrangling
df1 <- df %>%
group_by(cls) %>%
summarise(cls_pct = sum(pnts))
df1
df2 <- df %>%
group_by(cls,grd) %>%
summarize(grd_pct = sum(pnts))
df2
df3 <- df %>%
group_by(cls,grd,typ) %>%
summarise(typ_pct = sum(pnts))
df3
#### Attempt to combine all df1,df2,df3
# but mutate and summarise are mixing up leading to wrong results
df3 %>%
group_by(cls,grd) %>%
mutate(grd_pct = sum(typ_pct)) %>%
group_by(cls) %>%
mutate(cls_pct = sum(grd_pct))
Attempt to visualize all the ratios in 1 chart
data %>%
pivot_longer(cols = -c(cls:pnts),
names_to = "per_cat",
values_to = "percent") %>%
ggplot(aes(cls,percent, col = typ, fill = grd)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_bw()
plot of the same.

EDIT -- added formula version with more useful output for visualization.
ORIG: At this point it may be worth making a function to reduce copying and pasting, but this may get you what you need:
library(tidyverse)
df %>%
group_by(cls) %>%
mutate(per1 = sum(pnts),
per1_pct = per1 / sum(per1)) %>%
group_by(cls, grd) %>%
mutate(per2 = sum(pnts),
per2_pct = per2 / sum(per2)) %>%
group_by(cls, grd, typ) %>%
mutate(per3 = sum(pnts),
per3_pct = per3 / sum(per3)) %>%
ungroup()
EDIT: Here's a general function to calculate the stats for a given grouping, making it easier to combine a few groupings together in long format better suited for visualization.
df_sum <- function(df, level, ...) {
df %>%
group_by(...) %>%
summarize(grp_ttl = sum(pnts)) %>%
mutate(ttl = sum(grp_ttl),
pct = grp_ttl / ttl) %>%
ungroup() %>%
mutate(level = {{ level }} )
}
df_sum(df, level = 1, cls) %>%
bind_rows(df_sum(df, level = 2, cls, grd)) %>%
bind_rows(df_sum(df, level = 3, cls, grd, typ)) %>%
mutate(label = coalesce(as.character(typ), # This grabs the first non-NA
as.character(grd),
as.character(cls))) -> df_summed
df_summed %>%
ggplot(aes(level, grp_ttl)) +
geom_col(color = "white") +
geom_text(aes(label = paste0(label, "\n", grp_ttl, "/", ttl)),
color = "white",
position = position_stack(vjust = 0.5)) +
scale_x_reverse() + # To make level 1 at the top
coord_flip() # To switch from vertical to horizontal orientation

Categories

HOME

google-tag-manager

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Wrong visual plot representation of a correlation - r

You need backticks around the variable name. prices_monthly %>% correlate() %>% focus('BTC-USD') %>% ggplot(aes(x = rowname, y=`BTC-USD`)) + # <- Here geom_bar(stat="identity") + ylab("Correlation with BTC-USD") + xlab("Variable")

Related

How to convert to stat_summary_2d?

Creating a geographic file for use with tmap and coming up with error when coding shapefile

ggridges with time series - R

ggplot error using rmarkdown - object ´percent´ not found

How can we data wrangling to obtain shown ratio/proportion chart shown

Categories

Resources