Related
I have more than a hundred variables, if I use the plot function to plot the box plot, the readability is very poor, I want to use ggplot2 to plot
If you look at the source code for the plot.Boruta() boxplot function, there is a fair amount of data manipulation taking place before it can be plotted. Here is one example where I conducted the various data manipulations and altered the code to output a dataframe, which can then be passed to ggplot2 for plotting:
#install.packages("Boruta")
library(Boruta)
library(mlbench)
data(HouseVotes84)
na.omit(HouseVotes84)->hvo
Boruta(Class~.,data=hvo,doTrace=2)->Bor.hvo
#> 1. run of importance source...
#> 2. run of importance source...
#> 3. run of importance source...
...
#> 98. run of importance source...
#> 99. run of importance source...
print(Bor.hvo)
#> Boruta performed 99 iterations in 6.091485 secs.
#> 11 attributes confirmed important: V11, V12, V13, V14, V15 and 6 more;
#> 4 attributes confirmed unimportant: V1, V10, V16, V6;
#> 1 tentative attributes left: V2;
plot(Bor.hvo)
Create a custom 'Boruta data cleaning' function:
process_the_Boruta_data <- function(x, whichShadow=c(TRUE,TRUE,TRUE),
colCode=c('green','yellow','red','blue'),
col=NULL) {
if(is.null(x$ImpHistory))
stop('Importance history was not stored during the Boruta run.')
#Removal of -Infs and conversion to a list
lz <- lapply(1:ncol(x$ImpHistory),
function(i) x$ImpHistory[is.finite(x$ImpHistory[,i]),i])
colnames(x$ImpHistory) -> names(lz)
#Selection of shadow meta-attributes
numShadow <- sum(whichShadow)
lz[c(rep(TRUE,length(x$finalDecision)),whichShadow)] -> lz
generateCol<-function(x,colCode,col,numShadow){
#Checking arguments
if(is.null(col) & length(colCode)!=4)
stop('colCode should have 4 elements.')
#Generating col
if(is.null(col)){
rep(colCode[4],length(x$finalDecision)+numShadow)->cc
cc[c(x$finalDecision=='Confirmed',rep(FALSE,numShadow))]<-colCode[1]
cc[c(x$finalDecision=='Tentative',rep(FALSE,numShadow))]<-colCode[2]
cc[c(x$finalDecision=='Rejected',rep(FALSE,numShadow))]<-colCode[3]
col=cc
}
return(col)
}
#Generating color vector
col <- generateCol(x, colCode, col, numShadow)
#Ordering boxes due to attribute median importance
ii<-order(sapply(lz,stats::median))
lz[ii] -> lz
col <- col[ii]
lz_df <- do.call(rbind.data.frame, lz)
df <- as.data.frame(t(lz_df))
names(df) <- names(lz)
rownames(df) <- NULL
return(df)
}
# Apply the function:
clean_Bor.hvo <- process_the_Boruta_data(Bor.hvo)
# Plot the data:
library(tidyverse)
clean_Bor.hvo %>%
pivot_longer(everything()) %>%
ggplot(aes(x = fct_reorder(name, value, median), y = value)) +
geom_boxplot() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90,
vjust = 0.5))
Created on 2022-08-19 by the reprex package (v2.0.1)
I'm trying to enter the below data into a data frame, to make a ggplot line graph.
#functions for the hh budget and utility functions
pqxf <- function(y)(1*y) # replace p with price of y
pqyf <- function(x)(-1.25*x)+20 # -1.25 is the wage rate
utilityf <- function(x)80*(1/(x)) # 80 is the utility provided
hours <- c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)
#functions are turned into data frames
pqy <- data.frame("consumption" =
pqxf(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
pqx <- data.frame("leisure" =
pqxf(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
utility <- data.frame("utility" =
utilityf(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
#each data frame is combined into a single data frame, that will be used for tables and charts
hh <- data.frame(pqx, pqy, utility, hours)
print(hh)
#this shows the utility, and the cost of x and y, one data frame
library(ggplot2)
ggplot(hh, aes(x=pqx, y=hours))+
xlim(0,20)+ylim(0,20)+ # limits set for the assignment
labs(x = "leisure(hours)",y="counsumption(units)")+
geom_line(aes(x = pqx, y = pqy))+
geom_line(aes(x = pqx, y = utility))+
geom_point(aes(x=8,y=10))+ #values of x and y of tangent point
geom_hline(yintercept = 10,linetype="dotted")+ # y of tangent point
geom_vline(xintercept = 8,linetype = "dotted")+ #x of tangent point
geom_text(label="E", x=8,y=10,hjust=-1,size=2)+
geom_text(label="-1.25(units/hour)= -w = MRS", x=9,y=2,hjust=.02,size=2)+
geom_text(label="U=80", x=4,y=19,hjust=1,size=2)
when I enter I get the following message:
Error in is.finite(x) : default method not implemented for type 'list'
Should I store data in a different format than a data frame? format my data frame differently, or set up ggplot differently, so that it can handle lists?
Try to replace pqx with leisure, and pqy with comsumption.
Background and Summary of Objective
I am trying to find the y-coordinate at the intersection of two plotted curves using R. I will provide complete details and sample data below, but in the hopes that this is a simple problem, I'll be more concise up front.
The cumulative frequencies of two curves(c1 and c2 for simplicity) are defined by the following function, where a and b are known coefficients:
f(x)=1/(1+exp(-(a+bx)))
Using the uniroot() function, I found "x" at the intersection of c1 and c2.
I had assumed that if x is known then determining y should be simple substitution: for example, if x = 10, y=1/(1+exp(-(a+b*10))) (again, a and b are known values); however, as will be shown below, this is not the case.
The objective of this post is to determine how to find y-coordinate.
Details
This data replicates respondents' stated price at which they find the product's price to be too.cheap (i.e., they question its quality) and the price at which they feel the product is a bargain.
The data will be cleaned before use to ensure that too.cheap is
always less than the bargain price.
The cumulative frequency for the
bargain price will be inverted to become not.bargain.
The intersection of bargain and too.cheap will represent the point at
which an equal share of respondents feel the price is not a bargain
and too.cheap --- the point of marginal cheapness ("pmc").
Getting to the point where I'm having a challenge will take a number of steps.
Step 1: Generate some data
# load libraries for all steps
library(car)
library(ggplot2)
# function that generates the data
so.create.test.dataset <- function(n, mean){
step.to.bargain <- round(rnorm(n = n, 3, sd = 0.75), 2)
price.too.cheap <- round(rnorm(n = n, mean = mean, sd = floor(mean * 100 / 4) / 100), 2)
price.bargain <- price.too.cheap + step.to.bargain
df.temp <- cbind(price.too.cheap,
price.bargain)
df.temp <- as.data.frame(df.temp)
return(df.temp)
}
# create 389 "observations" where the too.cheap has a mean value of 10.50
# the function will also create a "bargain" price by
#adding random values with a mean of 3.00 to the too.cheap price
so.test.df <- so.create.test.dataset(n = 389, mean = 10.50)
Step 2: Create a data frame of cumulative frequencies
so.get.count <- function(p.points, p.vector){
cc.temp <- as.data.frame(table(p.vector))
cc.merged <- merge(p.points, cc.temp, by.x = "price.point", by.y = "p.vector", all.x = T)
cc.extracted <- cc.merged[,"Freq"]
cc.extracted[is.na(cc.extracted)] <- 0
return(cc.extracted)
}
so.get.df.price<-function(df){
# creates cumulative frequencies for three variables
# using the price points provided by respondents
# extract and sort all unique price points
# Thanks to akrun for their help with this step
price.point <- sort(unique(unlist(round(df, 2))))
#create a new data frame to work with having a row for each price point
dfp <- as.data.frame(price.point)
# Create cumulative frequencies (as percentages) for each variable
dfp$too.cheap.share <- 1 - (cumsum(so.get.count(dfp, df$price.too.cheap)) / nrow(df))
dfp$bargain.share <- 1 - cumsum(so.get.count(dfp, df$price.bargain)) / nrow(df)
dfp$not.bargain.share <- 1 - dfp$bargain.share# bargain inverted so curves will intersect
return(dfp)
}
so.df.price <- so.get.df.price(so.test.df)
Step 3: Estimate the curves for the cumulative frequencies
# Too Cheap
so.l <- lm(logit(so.df.price$too.cheap.share, percents = TRUE)~so.df.price$price.point)
so.cof.TCh <- coef(so.l)
so.temp.nls <- nls(too.cheap.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.TCh[1], b = so.cof.TCh[2]), data = so.df.price, trace = TRUE)
so.df.price$Pr.TCh <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
# Thanks to John Fox & Sanford Weisberg - "An R Companion to Applied Regression, second edition"
At this point, we can plot and compare the "observed" cumulative frequencies against the estimated frequencies
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
geom_line(aes(y = so.df.price$too.cheap.share, colour = "too.cheap.share"))+
geom_line(aes(y = so.df.price$not.bargain.share, colour = "not.bargain.share"))+
scale_y_continuous(name = "Cummulative Frequency")
The estimate appears to fit the observations reasonably well.
Step 4: Find the intersection point for the two estimate functions
so.f <- function(x, a, b){
# model for the curves
1 / (1 + exp(-(a + b * x)))
}
# note, this function may also be used in step 3
#I was building as I went and I don't want to risk a transpositional error that breaks the example
so.pmc.x <- uniroot(function(x) so.f(x, so.cof.TCh[1], so.cof.TCh[2]) - so.f(x, so.cof.Br[1], so.cof.Br[2]), c(0, 50), tol = 0.01)$root
We may visually test the so.pmc.x by plotting it with the two estimates. If it is correct, a vertical line for so.pmc.x should pass through the intersection of too.cheap and not.bargain.
ggplot(data = so.df.price, aes(x = price.point)) +
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap")) +
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain")) +
scale_y_continuous(name = "Cumulative Frequency") +
geom_vline(aes(xintercept = so.pmc.x))
...which it does.
Step 5: Find y
Here is where I get stumped, and I'm sure I'm overlooking something very basic.
If a curve is defined by f(x) = 1/(1+exp(-(a+bx))), and a, b and x are all known, then shouldn't y be the result of 1/(1+exp(-(a+bx))) for either estimate?
In this instance, it is not.
# We attempt to use the too.cheap estimate to find y
so.pmc.y <- so.f(so.pmc.x, so.cof.TCh[1], so.cof.TCh[2])
# In theory, y for not.bargain at price.point so.pmc.x should be the same
so.pmc.y2 <- so.f(so.pmc.x, so.cof.NBr[1], so.cof.NBr[2])
EDIT: This is where the error occurs (see solution below).
a != so.cof.NBr[1] and b != so.cof.NBr[2], instead a and be should be defined as the coefficients from so.temp.nls (not so.l)
# Which they are
#> so.pmc.y
#(Intercept)
# 0.02830516
#> so.pmc.y2
#(Intercept)
# 0.0283046
If we calculate the correct value for y, a horizontal line at yintercept = so.pmc.y, should pass through the intersection of too.cheap and not.bargain.
...which it obviously does not.
So how does one estimate y?
I've solved this, and as I suspected, it was a simple error.
My assumption that y = 1/(1+exp(-(a+bx))) is correct.
The issue is that I was using the wrong a, b coefficients.
My curve was defined using the coefficients in so.cof.NBr as defined by so.l.
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
But the resulting curve is so.temp.nls, NOT so.l.
Therefore, once I find so.pmc.x I need to extract the correct coefficients from so.temp.nls and use those to find y.
# extract coefficients from so.temp.nls
so.co <- coef(so.temp.nls)
# find y
so.pmc.y <- 1 / (1 + exp(-(so.co[1] + so.co[2] * so.pmc.x)))
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
scale_y_continuous(name = "Cumulative Frequency")+
geom_hline(aes(yintercept = so.pmc.y))
Yielding the following...
which graphically depicts the correct answer.
I am building a plot of Net Present Value (NPV), using FinCal package, and its odds. For the NPV, the cash-flows are simulated using a triangular distribution for sales, normal distribution for costs and so on. So, here is a snippet of what I am doing:
npvCdf <- function(n) {
N <- sort(n)
P <- ecdf(N)
return(P)
}
makePlot <- function(C, m) {
N <- m$NPV / C$MILLION
P <- npvCdf(N)
#
# NPV distribution curve
n <- sort(N)
p <- P(n) * 100
df <- data.frame(npv = n, odds = p)
#
# Points of interest
o <- C$NPV_BREAK_EVEN_WORST_ODDS
q <- round((quantile(n, o)), C$DIGITS)
e <- C$NPV_BREAK_EVEN_VALUE
b <- P(e) * 100 # THIS IS THE ERROR I CANT FIGURE OUT
w <- o * 100
s <- getBreakEven(C, m)
#
# Labels
npvOdds <- paste("Odds of break-even : ", b, "%")
salesThresh <- paste("Sales threshold : ", s)
worstCase <-
paste("Worst case (# 5% odds) : ", q, "million")
#
# Make plot
#
g <- ggplot(df, aes(x = npv, y = odds)) +
geom_line(colour = "blue") +
labs(title = "NPV and Odds") +
labs(x = "NPV (million)") +
labs(y = "Percent (%)") +
geom_vline(xintercept = e,
colour = "red",
linetype = "longdash") +
geom_hline(yintercept = b,
colour = "green",
linetype = "longdash") +
geom_vline(xintercept = q,
colour = "green",
linetype = "dotdash") +
geom_hline(yintercept = w,
colour = "red",
linetype = "dotdash")
The C is a data frame of all the constants that are used for calculations of cash-flows, NPV calculations, etc. For example, C$MILLION=1000000 used to divide NPV for simpler representation. The m is a data-frame of sales, cash-flows and NPV per simulation. The simulations are used for cash-flows (triangular distribution), variable cost (normal distribution) and so on.
And, here is the Shiny code that uses the above snippet.
library(shiny)
source("../npd-c.R")
# Define server logic
shinyServer(function(input, output) {
output$npdPlot <- renderPlot({
C <- data.frame(2017,5000,1000000,3,100,500000,0.0,0.05,0.1,
input$salesRange[1],
input$salesRange[2],
input$salesMode,
input$demDeclMean,
input$demDeclSd,
input$varCostMean,
input$varCostSd,
input$fixedCostRange[1],
input$fixedCostRange[2]
)
names(C) <-
c(
"SEED",
"ITERATIONS",
"MILLION",
"DIGITS",
"PRICE",
"OUTLAY",
"NPV_BREAK_EVEN_VALUE",
"NPV_BREAK_EVEN_WORST_ODDS",
"HURDLE_RATE",
"SALES_TRIANG_MIN",
"SALES_TRIANG_MAX",
"SALES_TRIANG_MODE",
"DEM_DECL_FACTOR_MEAN",
"DEM_DECL_FACTOR_SD",
"VAR_COST_RATE_MEAN",
"VAR_COST_RATE_SD",
"FIX_COST_RATE_MIN",
"FIX_COST_RATE_MAX"
)
n <- npd(C,-1)
g <- makePlot(C,n)
g
})
})
The problem is as follows.
The same code when run in R, I get the plot right in terms of the NPV curve, horizontal and vertical lines. Whereas, when run as a Shiny application, the horizontal and vertical lines are hugely displaced. This is despite, hiving of the NPV and cash-flows code into a separate .R file and setting the same seed value for both the Shiny and non-shiny versions. For example, P(0)=40.07 without Shiny and P(0)=4.7 with Shiny application.
What am I missing?
First of all, let me say this is pretty useful code. It is a nice represenation of a monte-carlo simulation using NPV and I like the plots. It is a post I am pretty sure I will refer back to.
I think I see where the problem is though, it is basically more a matter of mis-interpretation and one small programming error.
The stated problem is that these plots are not showing the same results although they should be. The blue ecdf-NPV curves do look at first glance to be the same:
Shiny version:
Stand alone version:
However if you look carefully, you will see that in fact they are not the same, the expected NPV value (50 percent) in the first case is about 1.5 million, whereas it is only about 0.2 in the second case.
The curves look the same, but they are not. The other point is that there is an error in one of the calculations further confusing things. The "Odds of break-even" are incorrectly calculated and are actually the "Odds of losing money".
The correct calculation should be:
b <- (1-P(e)) * 100
And the correct odds of breaking even in the first case would be around 60%, and in the second case around 95%, which matches up with the expected NPV as well.
I would like to make a barplot of the columns V2 and length. I would also plot the standard deviation from the number in length for each group.
> head(Length_filter3)
V1 V2 V3 length
1 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGG 30
2 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGGG 31
3 URS00000081EA snRNA AAACTCGACTGCATAATTTGTGGTAGTGGGGGACT 35
4 URS0000008112A tRNA AAACTCGACTGCATAATTTGTGGTAGTGGGGGACTG 36
5 URS000000812A tRNA AAATGTGGGAAACTCGACTGCATAATTTGTGGTAGTGGGGGACT 44
6 URS0000008121EA tRNA AACTCGACTGCATAATTTGTGGTAGTGGG 29
ggplot(Length_filter3, aes(V2,length)) + geom_bar(stat="identity")
I am assuming that you are looking to create some sort of summary statistic, such as average, rather than trying to plot the total length of all of the RNA types (for which there would be no error bar to speak of).
If it has to be a bar plot, you will likely need to calculate the values yourself. Here, I am manually calculating the ranges I want from the iris data (using dplyr):
summarizedData <-
iris %>%
group_by(Species) %>%
summarise(
mean = mean(Petal.Length)
, sd = sd(Petal.Length)
, low = mean + sd/(sqrt(n())) * qt(0.025, n()-1 )
, high = mean + sd/(sqrt(n())) * qt(0.975, n()-1 )
)
ggplot(
summarizedData
, aes(x = Species
, y = mean
, ymax = high
, ymin = low)
) +
geom_bar(stat = "identity") +
geom_linerange()
Alternatively, you can let ggplot do the work for you, particularly if you are willing to use points and error bars instead of a bar plot (I tend to prefer it this way)
ggplot(
iris
, aes(x = Species
, y = Petal.Length)
) +
stat_summary(fun.data = mean_cl_normal)
You can combine these approaches if you like as well.
try FUN function in ggplot choosing stdev.