How to read in Unicode code points into R data frame - r

I have a two-column text file of Unicode code points of interest (Greek symbols in this test, but any set of Unicode characters, generically):
$ cat ut.txt
\u0391 Α
\u0392 Β
\u0393 Γ
\u0394 Δ
\u0395 Ε
\u0396 Ζ
...
I'd like to read this into R, so that I can kick the tires on the typeface I am using to make plots that contain mathematical or other Unicode symbols.
As a minimally-reproducible start, I start by drawing a random sample from this Unicode table:
set.seed(42)
df <- data.frame(date = 1:10 , value = cumsum(runif(10 , max = 10)) )
ut <- read.table("ut.txt", allowEscapes=TRUE)
df$labels <- paste("\\", sample(ut$V1, size=10), sep="")
The head of the data frame looks like this:
date value labels
1 1 9.14806 \\u03A8
2 2 18.51881 \\u03BB
3 3 21.38021 \\u03C4
4 4 29.68469 \\u039C
5 5 36.10214 \\u03A6
6 6 41.29310 \\u03C2
When I plot from the labels column, R writes out the literal string, and not the Unicode character it represents:
library(ggplot2)
p <- ggplot(df, aes(x=date, y=value, label=labels))
p <- p + geom_line()
p <- p + ggtitle("5\u03BCg (\u03C7-squared test)") # control title
p <- p + geom_text()
library(Cairo)
ggsave("test.pdf", device=cairo_pdf)
Here is what the test plot looks like:
What I would like to see are Greek symbols at each point along the line, instead of their literal string equivalents.
How can I read a set of Unicode code points from a text file and use them directly?
Important note: I did test sampling from the second column of ut.txt, which works. However, I am specifically interested in learning what is required to correctly read in the encoded code point equivalent from a file.

Here's one approach using scale_shape_manual. I included the code of how I entered your data so I didn't have to read the text file
set.seed(42)
df <- data.frame(date = 1:10 , value = cumsum(runif(10 , max = 10)) )
df <- df[1:6, ]
## Following line stands in for what you read from `read.table`. In your solution, just use what you got from `read.table`
df$labels <- c("\u03A8", "\u03BB", "\u03C4", "\u039C", "\u03A6", "\u03C2")
library(ggplot2)
p <- ggplot(df, aes(x=date, y=value, shape = labels))
p <- p + geom_line()
p <- p + ggtitle("5\u03BCg (\u03C7-squared test)") # control title
p <- p + geom_point(size = 5) + scale_shape_manual(values = df$labels)
p

It also works with geom_text() (Infinite thanks to #astrofunkswag for its smart way to teach us how to include symbols properly):
library(ggplot2)
#Code
ggplot(df, aes(x=date, y=value))+
geom_line()+
ggtitle("5\u03BCg (\u03C7-squared test)")+
geom_text(label=c("\u03A8", "\u03BB", "\u03C4", "\u039C", "\u03A6", "\u03C2"))
Output:

Related

How to enter data into ggplots

I'm trying to enter the below data into a data frame, to make a ggplot line graph.
#functions for the hh budget and utility functions
pqxf <- function(y)(1*y) # replace p with price of y
pqyf <- function(x)(-1.25*x)+20 # -1.25 is the wage rate
utilityf <- function(x)80*(1/(x)) # 80 is the utility provided
hours <- c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)
#functions are turned into data frames
pqy <- data.frame("consumption" =
pqxf(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
pqx <- data.frame("leisure" =
pqxf(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
utility <- data.frame("utility" =
utilityf(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
#each data frame is combined into a single data frame, that will be used for tables and charts
hh <- data.frame(pqx, pqy, utility, hours)
print(hh)
#this shows the utility, and the cost of x and y, one data frame
library(ggplot2)
ggplot(hh, aes(x=pqx, y=hours))+
xlim(0,20)+ylim(0,20)+ # limits set for the assignment
labs(x = "leisure(hours)",y="counsumption(units)")+
geom_line(aes(x = pqx, y = pqy))+
geom_line(aes(x = pqx, y = utility))+
geom_point(aes(x=8,y=10))+ #values of x and y of tangent point
geom_hline(yintercept = 10,linetype="dotted")+ # y of tangent point
geom_vline(xintercept = 8,linetype = "dotted")+ #x of tangent point
geom_text(label="E", x=8,y=10,hjust=-1,size=2)+
geom_text(label="-1.25(units/hour)= -w = MRS", x=9,y=2,hjust=.02,size=2)+
geom_text(label="U=80", x=4,y=19,hjust=1,size=2)
when I enter I get the following message:
Error in is.finite(x) : default method not implemented for type 'list'
Should I store data in a different format than a data frame? format my data frame differently, or set up ggplot differently, so that it can handle lists?
Try to replace pqx with leisure, and pqy with comsumption.

How to use ggplot with prop.table(table(x)?

First, I have a data with two categorical variables into like this:
nombre <- c("A","B","C","A","D","F","F","H","I","J")
sexo <- c(rep("man",4),rep("woman",6))
edad <- c (25,14,25,76,12,90,65,45,56,43)
pais <- c(rep("spain",3),rep("italy",4),rep("portugal",3))
data <- data.frame(nombre=nombre,sexo=sexo,edad=edad,pais=pais)
If I use:
prop.table(table(data$sexo,data$pais), margin=1)
I can see the relative frequency of the levels, for example for Italy (Man=0.25 Woman=0.5)
but the problem is that when I try to plot the prop.table(table(x)) I get something different
ggplot(as.data.frame(prop.table(table(data),margin=1)), aes(x=pais ,y =Freq, fill=sexo))+geom_bar(stat="identity")
On the Y axis from 0 to 3 and for example in the bar Italy (Woman=2 Man=2.5)
I don't need that (and I don't know what is showing), I want the same with as I had with the table of the prop.table(table(x))
I think the problem is something related with the margin=1
Thanks you!
You need to make the same table
tab = prop.table(table(data$sexo,data$pais), margin=1)
tab = as.data.frame(tab)
Then plot:
ggplot(tab,aes(x=Var2,y=Freq,fill=Var1)) + geom_col()
Or simply:
barplot(prop.table(table(data$sexo,data$pais), margin=1))
You're probably looking for something like position = "dodge"
If I run the following on your data :
P <- prop.table(table(data$sexo,data$pais), margin=1)
ggplot(as.data.frame(P), aes(x = Var2, y = Freq, fill = Var1)) +
geom_bar(stat="identity", position = "dodge")
I output the following graph :

Plotting a datable with multiple columns (all 1:7 rows) via ggplot with a single geom_point() using aesthetics to color them differently

I intend to compare timings between two algorithm-based functions f1,f2 via microbenchmark which work on a rpois simulated dataset with sizes of: [1:7] vector given by 10^seq(1,4,by=0.5) i.e. :
[1] 10.00000 31.62278 100.00000 316.22777 1000.00000 3162.27766 10000.00000
Am working on to plot them as well, with all of the information required from microbenchmark (i.e. min,lq,mean,median,uq and max - yes all of them are required, except for expr and neval). I require this via ggplot on a log-log scale with a single geom_point() and aesthetics with each of the information being of different colours and here is my code for that:
library(ggplot2)
library(microbenchmark)
require(dplyr)
library(data.table)
datasetsizes<-c(10^seq(1,4,by=0.5))
f1_min<-integer(length(datasetsizes))
f1_lq<-integer(length(datasetsizes))
f1_mean<-integer(length(datasetsizes))
f1_median<-integer(length(datasetsizes))
f1_uq<-integer(length(datasetsizes))
f1_max<-integer(length(datasetsizes))
f2_min<-integer(length(datasetsizes))
f2_lq<-integer(length(datasetsizes))
f2_mean<-integer(length(datasetsizes))
f2_median<-integer(length(datasetsizes))
f2_uq<-integer(length(datasetsizes))
f2_max<-integer(length(datasetsizes))
for(loopvar in 1:(length(datasetsizes)))
{
s<-summary(microbenchmark(f1(rpois(datasetsizes[loopvar],10), max.segments=3L),f2(rpois(datasetsizes[loopvar],10), maxSegments=3)))
f1_min[loopvar] <- s$min[1]
f2_min[loopvar] <- s$min[2]
f1_lq[loopvar] <- s$lq[1]
f2_lq[loopvar] <- s$lq[2]
f1_mean[loopvar] <- s$mean[1]
f2_mean[loopvar] <- s$mean[2]
f1_median[loopvar] <- s$median[1]
f2_median[loopvar] <- s$median[2]
f1_uq[loopvar] <- s$uq[1]
f2_uq[loopvar] <- s$uq[2]
f1_max[loopvar] <- s$max[1]
f2_max[loopvar] <- s$max[2]
}
algorithm<-data.table(f1_min ,f2_min,
f1_lq, f2_lq,
f1_mean, f2_mean,
f1_median, f2_median,
f1_uq, f2_uq,
f1_max, cdpa_max, datasetsizes)
ggplot(algorithm, aes(x=algorithm,y=datasetsizes)) + geom_point(aes(color=algorithm)) + labs(x="N", y="Runtime") + scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
I debug my code at each step and uptil the assignment of computed values to a datatable by the name of 'algorithm' it works fine.
Here are the computed runs which are passed as [1:7]vecs into the data table along with datasetsizes (1:7 as well) at the end:
> algorithm
f1_min f2_min f1_lq f2_lq f1_mean f2_mean f1_median f2_median f1_uq f2_uq f1_max f2_max datasetsizes
1: 86.745000 21.863000 105.080000 23.978000 113.645630 24.898840 113.543500 24.683000 120.243000 25.565500 185.477000 39.141000 10.00000
2: 387.879000 52.893000 451.880000 58.359000 495.963480 66.070390 484.672000 62.061000 518.876500 66.116500 734.149000 110.370000 31.62278
3: 1608.287000 341.335000 1845.951500 382.062000 1963.411800 412.584590 1943.802500 412.739500 2065.103500 443.593500 2611.131000 545.853000 100.00000
4: 5.964166 3.014524 6.863869 3.508541 7.502123 3.847917 7.343956 3.851285 7.849432 4.163704 9.890556 5.096024 316.22777
5: 23.128505 29.687534 25.348581 33.654475 26.860166 37.576444 26.455269 37.080149 28.034113 41.343289 35.305429 51.347386 1000.00000
6: 79.785949 301.548202 88.112824 335.135149 94.248141 370.902821 91.577462 373.456685 98.486816 406.472393 135.355570 463.908240 3162.27766
7: 274.367776 2980.122627 311.613125 3437.044111 337.287131 3829.503738 333.544669 3820.517762 354.347487 4205.737045 546.996092 4746.143252 10000.00000
The microbenchmark computed values fine as expected but the ggplot throws up this error:
Don't know how to automatically pick scale for object of type data.table/data.frame. Defaulting to continuous.
Error: Aesthetics must be either length 1 or the same as the data (7): colour, x
Am not being able to resolve this, can anyone let me know what is possibly wrong and correct the plotting procedure for the same?
Also on a sidenote I had to extract all the values (min,lq,mean,median,uq,max) seperately from the computed benchmark seperately since I cant take that as a datatable from the summary itself since it contained expr (expression) and neval columns. I was able to eliminate one of the columns using
algorithm[,!"expr"] or algorithm[,!"neval"]
but I can't eliminate two of them together, i.e.
algorithm[,!"expr",!"neval"] or algorithm[,!("expr","neval")] or algorithm[,!"expr","neval"]
- all possible combinations like that don't work (throws 'invalid argument type' error).
Any possible workaround or solution to this and the plotting (main thing) would be highly appreciated!
Your problem lies mainly with the fact that you're referring to an algorithm column in the ggplot formula that does not exist in your object.
From what you gave, I could do the following :
algorithm$algorithm <- 1:nrow(algorithm)
ggplot(algorithm, aes(x=algorithm,y=datasetsizes)) + geom_point(aes(color=algorithm)) + labs(x="N", y="Runtime") +
scale_x_continuous(trans = 'log10') + scale_y_continuous(trans = 'log10')
and plot this fine :
EDIT : let's clean this up a bit...
As per OP's request, I've cleaned up his code a bit.
There are a lot of things you can work on to improve on your code's readability, but I'm focusing more on the practical aspect here.
Basically, join your variables together in a table if you know they'll end up as such.
There are a bunch of tricks you can use to assign the values to the correct spots, a few of which you'll see in the code below.
library(ggplot2)
library(microbenchmark)
require(dplyr)
library(data.table)
datasetsizes<-c(10^seq(1,4,by=0.5))
l <- length(datasetsizes)
# make a vector with your different conditions
conds <- c('f1', 'f2')
# initalizing a table from the getgo is much cleaner
# than doing everything in separate variables
dat <- data.frame(
datasetsizes = rep(datasetsizes, each = length(conds)), # make replicates for each condition
cond = rep(NA, l*length(conds))
)
dat[, c("min", "lq", "mean", "median", "uq", "max")] <- 0
dat$cond <- factor(dat$cond, levels = conds)
head(dat)
for(i in 1:l){ # for the love of god, don't use something as long as 'loopvar' as an iterative
# I don't have f1 & f2 so I did what I could...
s <- summary(microbenchmark(
"f1" = rpois(datasetsizes[i],10),
"f2" = {length(rpois(datasetsizes[i],10))}))
dat[which(dat$datasetsizes == datasetsizes[i]), # select rows of current ds size
c("cond", "min", "lq", "mean", "median", "uq", "max")] <- s[, !colnames(s)%in%c("neval")]
}
dat <- data.table(dat)
ggplot(dat, aes(x=datasetsizes,y=mean)) +
geom_point(aes(color = cond)) +
geom_line(aes(color = cond)) + # added to see a clear difference btw conds
labs(x="N", y="Runtime") + scale_x_continuous(trans = 'log10') +
scale_y_continuous(trans = 'log10')
This give the following plot.

R - Random number generation in ggplot and Shiny

I am building a plot of Net Present Value (NPV), using FinCal package, and its odds. For the NPV, the cash-flows are simulated using a triangular distribution for sales, normal distribution for costs and so on. So, here is a snippet of what I am doing:
npvCdf <- function(n) {
N <- sort(n)
P <- ecdf(N)
return(P)
}
makePlot <- function(C, m) {
N <- m$NPV / C$MILLION
P <- npvCdf(N)
#
# NPV distribution curve
n <- sort(N)
p <- P(n) * 100
df <- data.frame(npv = n, odds = p)
#
# Points of interest
o <- C$NPV_BREAK_EVEN_WORST_ODDS
q <- round((quantile(n, o)), C$DIGITS)
e <- C$NPV_BREAK_EVEN_VALUE
b <- P(e) * 100 # THIS IS THE ERROR I CANT FIGURE OUT
w <- o * 100
s <- getBreakEven(C, m)
#
# Labels
npvOdds <- paste("Odds of break-even : ", b, "%")
salesThresh <- paste("Sales threshold : ", s)
worstCase <-
paste("Worst case (# 5% odds) : ", q, "million")
#
# Make plot
#
g <- ggplot(df, aes(x = npv, y = odds)) +
geom_line(colour = "blue") +
labs(title = "NPV and Odds") +
labs(x = "NPV (million)") +
labs(y = "Percent (%)") +
geom_vline(xintercept = e,
colour = "red",
linetype = "longdash") +
geom_hline(yintercept = b,
colour = "green",
linetype = "longdash") +
geom_vline(xintercept = q,
colour = "green",
linetype = "dotdash") +
geom_hline(yintercept = w,
colour = "red",
linetype = "dotdash")
The C is a data frame of all the constants that are used for calculations of cash-flows, NPV calculations, etc. For example, C$MILLION=1000000 used to divide NPV for simpler representation. The m is a data-frame of sales, cash-flows and NPV per simulation. The simulations are used for cash-flows (triangular distribution), variable cost (normal distribution) and so on.
And, here is the Shiny code that uses the above snippet.
library(shiny)
source("../npd-c.R")
# Define server logic
shinyServer(function(input, output) {
output$npdPlot <- renderPlot({
C <- data.frame(2017,5000,1000000,3,100,500000,0.0,0.05,0.1,
input$salesRange[1],
input$salesRange[2],
input$salesMode,
input$demDeclMean,
input$demDeclSd,
input$varCostMean,
input$varCostSd,
input$fixedCostRange[1],
input$fixedCostRange[2]
)
names(C) <-
c(
"SEED",
"ITERATIONS",
"MILLION",
"DIGITS",
"PRICE",
"OUTLAY",
"NPV_BREAK_EVEN_VALUE",
"NPV_BREAK_EVEN_WORST_ODDS",
"HURDLE_RATE",
"SALES_TRIANG_MIN",
"SALES_TRIANG_MAX",
"SALES_TRIANG_MODE",
"DEM_DECL_FACTOR_MEAN",
"DEM_DECL_FACTOR_SD",
"VAR_COST_RATE_MEAN",
"VAR_COST_RATE_SD",
"FIX_COST_RATE_MIN",
"FIX_COST_RATE_MAX"
)
n <- npd(C,-1)
g <- makePlot(C,n)
g
})
})
The problem is as follows.
The same code when run in R, I get the plot right in terms of the NPV curve, horizontal and vertical lines. Whereas, when run as a Shiny application, the horizontal and vertical lines are hugely displaced. This is despite, hiving of the NPV and cash-flows code into a separate .R file and setting the same seed value for both the Shiny and non-shiny versions. For example, P(0)=40.07 without Shiny and P(0)=4.7 with Shiny application.
What am I missing?
First of all, let me say this is pretty useful code. It is a nice represenation of a monte-carlo simulation using NPV and I like the plots. It is a post I am pretty sure I will refer back to.
I think I see where the problem is though, it is basically more a matter of mis-interpretation and one small programming error.
The stated problem is that these plots are not showing the same results although they should be. The blue ecdf-NPV curves do look at first glance to be the same:
Shiny version:
Stand alone version:
However if you look carefully, you will see that in fact they are not the same, the expected NPV value (50 percent) in the first case is about 1.5 million, whereas it is only about 0.2 in the second case.
The curves look the same, but they are not. The other point is that there is an error in one of the calculations further confusing things. The "Odds of break-even" are incorrectly calculated and are actually the "Odds of losing money".
The correct calculation should be:
b <- (1-P(e)) * 100
And the correct odds of breaking even in the first case would be around 60%, and in the second case around 95%, which matches up with the expected NPV as well.

automatically convert polynomial to expression in ggplot2 title

Here is some code I am using to auto generate some regression fits;
require(ggplot2)
# Prep data
nPts = 200
prepared=runif(nPts,0,10)
rich=5-((prepared-5)^2)/5 + 5*runif(length(prepared))
df <- data.frame(rich=rich, prepared=prepared)
deg = 1 # User variable
lm <- lm(df$rich ~ poly(df$prepared, deg, raw=T))
# Create expression
coefs <- lm$coefficients
eq <- paste0(round(coefs,2),'*x^', 0:length(coefs), collapse='+') # (1)
pl <- ggplot(df, aes(x=prepared, y=rich)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x,deg), size = 1) +
ggtitle(eq) # (2)
print(pl)
This code should run (with ggplot2 installed). The problem is in the lines marked 1 and 2:
Generates a string representation of the polynomial
Sets the string as the plot title
As it stand my title is "6.54*x^0+0.09*x^1+6.54*x^2". However I want a more attractive rendering so that (2) is more like would be seen with:
ggtitle(expression(6.54*x^0+0.09*x^1+6.54*x^2)) # (2')
i.e, powers raised, multiplications dropped etc. Any help much appreciated!
Here's a function that I built to solve my problem;
poly_expression <- function(coefs){
# build the string
eq <- paste0(round(coefs,2),'*x^', (1:length(coefs)-1), collapse='+')
# some cleaning
eq <- gsub('\\+\\-','-', eq) # +-n -> -n
eq <- gsub('\\*x\\^0','', eq) # n*x^0 <- n
eq <- gsub('x\\^1','x', eq) # n*x^1 <- nx
eq <- parse(text=eq) # return expressions
return(eq)
}
Then ggtitle(poly_expression(coefs)) renders as required.

Resources