I am new to R and I would like to know how to generate histograms for the following situation :
I initially have a regular frequency table with 2 columns : Column A is the category (or bin) and Column B is the number of cases that fall in that category
Col A Col B
1-10 7
11-20 4
21-30 5
From this initial frequency table, I create a table with 3 columns : Col A is again the category (or bin), but now Col B is the "fraction of total cases", so for the category 1-10, column B will have the value 7/(7+4+5) = 7/16 . Now there is also a third column, Col C which is "fraction of total cases falling between the categories 1-20", so for 1-10, the value for Col C would be 7/(7+4) = 7/11. The complete table would look like below :
Col A Col B Col C
1-10 7/16 7/11
11-20 4/16 4/11
21-30 5/16 0
How do I generate a histogram from this 3-column table above ? My X axis should be the bin (1-10, 11-20 etc.) and my Y axis should be the fraction, however for every bin I have two fractions (Col B and Col C), so there will be two fraction "bars" for every bin in the histogram.
Any help would be greatly appreciated.
The data:
dat <- data.frame(A = c("1-10", "11-20", "21-30"), B = c(7, 4, 5))
Now, calculate the proportions and create a new object:
dat2 <- rbind(B = dat$B/sum(dat$B), C = c(dat$B[1:2]/sum(dat$B[1:2]), 0))
colnames(dat2) <- dat$A
Plot:
barplot(dat2, beside = TRUE, legend = rownames(dat2))
Your title should be changed to "Dodged Bar Chart" instead of 2D histogram, because histograms have continuous scale on x axis unlike bar chart and they are basically used for comparing the distributions of univariate data or the distributions of univariate data modeled on the dependent factor. You are trying to compare colB vs colC which can be effectively visualized using a 2D scatter plot but not with bar chart. The better way to compare the distributions of colB and colC using histograms would be plotting two histograms separately and check the change in location of the data points.
If you want to compare distributions of colB and colC, try the following code: I did round up the values for getting a reasonable data per your data description. Notice a random sampling by permutation is happening and everytime, you run the same code, there will be slight change in the distribution, but that will not affect the inference of distribution between colB and colC.
library("ggplot2")
# 44 datapoints between 1-10
a <- rep(1:10, 4)
a <- c(a, sample(a, size=4, replace=FALSE))
# 25 datapoints between 11-20
b <- rep(11:20, 2)
b <- c(b, sample(b, size=5, replace=FALSE))
# 31 datapoints between 21-30
c <- rep(21:30, 3)
c <- c(c, sample(c, size=1, replace=FALSE))
colB <- c(a, b, c)
# 64 datapoints between 1-10
a <- rep(1:10, 6)
a <- c(a, sample(a, size=4, replace=FALSE))
# 36 datapoints between 11-20
b <- rep(11:20, 3)
b <- c(b, sample(b, size=6, replace=FALSE))
colC <- c(a, b)
df <- data.frame(cbind(colB, colC=colC))
write.table(df, file = "data")
data <- read.table("data", header=TRUE)
data
ggplot(data=data, aes(x=colB, xmin=1, xmax=30)) + stat_bin(binwidth = 1)
ggplot(data=data, aes(x=colC, xmin=1, xmax=30)) + stat_bin(binwidth = 1)
# if you want density distribution, then you can try something like this:
ggplot(data=data, aes(x=colB, y = ..density.., xmin=1, xmax=30)) + stat_bin(binwidth = 1)
ggplot(data=data, aes(x=colC, y = ..density.., xmin=1, xmax=30)) + stat_bin(binwidth = 1)
HTH
-Sathish
Related
I would like to create a percentage stacked barplot in Julia. In R we may do the following:
set.seed(7)
data <- matrix(sample(1:30,6), nrow=3)
colnames(data) <- c("A","B")
rownames(data) <- c("V1","V2","V3")
library(RColorBrewer)
cols <- brewer.pal(3, "Pastel1")
df_percentage <- apply(data, 2, function(x){x*100/sum(x,na.rm=T)})
barplot(df_percentage, col=cols, border="white", xlab="group")
Created on 2022-12-29 with reprex v2.0.2
I am now able to create the axis in percentages, but not to make it stacked and percentage for each stacked bar like above. Here is some reproducible code:
using StatsPlots
measles = [38556, 24472]
mumps = [20178, 23536]
chickenPox = [37140, 32169]
ticklabel = ["A", "B"]
foo = #. measles + mumps + chickenPox
my_range = LinRange(0, maximum(foo), 11)
groupedbar(
[measles mumps chickenPox],
bar_position = :stack,
bar_width=0.7,
xticks=(1:2, ticklabel),
yticks=(my_range, 0:10:100),
label=["measles" "mumps" "chickenPox"]
)
Output:
This is almost what I want. So I was wondering if anyone knows how to make a stacked percentage barplot like above in Julia?
You just need to change the maximum threshold of the LinRange to be fitted to the maximum value of bars (which is 1 in this case), and change the input data for plotting to be the proportion of each segment:
my_range = LinRange(0, 1, 11)
foo = #. measles + mumps + chickenPox
groupedbar(
[measles./foo mumps./foo chickenPox./foo],
bar_position = :stack,
bar_width=0.7,
xticks=(1:2, ["A", "B"]),
yticks=(my_range, 0:10:100),
label=["measles" "mumps" "chickenPox"],
legend=:outerright
)
If you want to have the percentages on each segment, then you can use the following function:
function percentages_on_segments(data)
first_phase = permutedims(data)[end:-1:1, :]
a = [0 0;first_phase]
b = accumulate(+, 0.5*(a[1:end-1, :] + a[2:end, :]), dims=1)
c = vec(b)
annotate!(
repeat(1:size(data, 1), inner=size(data, 2)),
c,
["$(round(100*item, digits=1))%" for item=vec(first_phase)],
:white
)
end
percentages_on_segments([measles./foo mumps./foo chickenPox./foo])
Note that [measles./foo mumps./foo chickenPox./foo] is the same data that I passed to the groupedbar function:
I have a dataset with 7 columns and want to plot the 6 latter each with the first one. If done for one, everything works, but apparently I miss something when looping.
Here is my dataframe:
colnames(df) <- c("real", "est1", "est2", "est3", "est4", "est5", "est6")
head(df)
real est1 est2 est3 est4 est5 est6
1 6 1.040217e-05 7.693853e-05 0.0006782929 0.002676282 0.033385059 0.9631730251
2 6 1.065455e-05 7.880501e-05 0.0006947352 0.002740934 0.034161665 0.9623132055
3 5 1.037427e-03 7.607541e-03 0.0624143732 0.185340034 0.536009785 0.2075908392
4 1 2.345527e-01 4.855757e-01 0.2374464964 0.032691816 0.008846185 0.0008870667
5 5 3.506084e-04 2.585847e-03 0.0222474072 0.079120851 0.458854341 0.4368409455
6 3 1.710639e-03 1.247417e-02 0.0978889632 0.250555703 0.500355545 0.1370149767
and the code
boxplot( est1 ~ real, data=df, main="Estimated Probability for Category 1 Given the Real Categories")
works fine, but if I do the exactly same as a loop, it doesn't:
looper <- c("est1", "est2", "est3", "est4", "est5", "est6") #to get the column names to loop over
counter <- 0 # for the boxplot's title
for (i in all_of(looper)){
counter <- counter +1
boxplot( i ~ real, data=df,
main = paste("Estimated Probability for Category",counter,"Given the Real Categories")
)
}
I suppose it has to do with the way i is used, I tried "i" and also with one ` and I would always get one of the following errors:
For i: Fehler in stats::model.frame.default(formula = i ~ real, data = eval.m1_sim) :
Variablenlängen sind unterschiedlich (gefunden für 'real')
For "i" or ` : Fehler in terms.formula(formula, data = data) :
ungültiger Term in Modellformel
What am I missing?
You could go via column numbers:
# random example data as no reproducible example was given
df <- data.frame(
real = sample(1:4, 20, TRUE),
one = runif(20),
two = runif(20),
three = runif(20))
)
# graphics paramaters so we see all at once
par(mfrow = c(3,1), mar = c(2, 2, 1, 1))
# the easiest way is through column numbers
for(column in 2:4)
boxplot(df[[column]] ~ df$real)
Another option:
library(tidyverse)
df %>%
pivot_longer(-real) %>%
mutate(real = factor(real)) %>%
ggplot(aes(real, value)) +
geom_boxplot() +
facet_wrap(~name)
Imagine I have data foo below. Each row contains a measurement (y) on a species and each species is paired with another (species.pair). So in the example below, species a is paired with e, b with f, and so on. The number of observations for each species varies. I'd like to plot the density of each species's distribution along with its partner's distribution in its own facet. Below I hand coded this with the column sppPairs. The species are all unique and each has a match in species.pair. I'm unsure of how to make the grouping column sppPairs below. I'm sure there is some clever way to do this with {dplyr} but I can't figure out what to do. Some kind of pasting species to species.pair I imagine? Any help much appreciated.
foo <- data.frame(species = rep(letters[1:8],each=10),
species.pair = rep(letters[c(5:8,1:4)],each=10),
y=rnorm(80))
# species and species pair match exactly
all(unique(foo$species) %in% unique(foo$species.pair))
# what I want
foo$sppPairs <- c(rep("a:e",10),
rep("b:f",10),
rep("c:g",10),
rep("d:h",10),
rep("a:e",10),
rep("b:f",10),
rep("c:g",10),
rep("d:h",10))
p1 <- ggplot(foo,aes(y,fill=species))
p1 <- p1 + geom_density(alpha=0.5)
p1 <- p1 + facet_wrap(~sppPairs)
p1
Yes, you can use apply on the appropriate columns to paste the sorted elements together in the correct order (otherwise a:e is different from e:a and so on, and you end up with 8 groups instead of 4):
library(ggplot2)
foo <- data.frame(species = rep(letters[1:8], each = 10),
species.pair = rep(letters[c(5:8, 1:4)], each = 10),
y = rnorm(80))
foo$sppPairs <- apply(foo[c("species", "species.pair")], 1,
function(x) paste(sort(x), collapse = ":"))
ggplot(foo, aes(y, fill = species)) +
geom_density(alpha = 0.5) +
facet_wrap(~sppPairs)
Created on 2020-10-05 by the reprex package (v0.3.0)
Background and Summary of Objective
I am trying to find the y-coordinate at the intersection of two plotted curves using R. I will provide complete details and sample data below, but in the hopes that this is a simple problem, I'll be more concise up front.
The cumulative frequencies of two curves(c1 and c2 for simplicity) are defined by the following function, where a and b are known coefficients:
f(x)=1/(1+exp(-(a+bx)))
Using the uniroot() function, I found "x" at the intersection of c1 and c2.
I had assumed that if x is known then determining y should be simple substitution: for example, if x = 10, y=1/(1+exp(-(a+b*10))) (again, a and b are known values); however, as will be shown below, this is not the case.
The objective of this post is to determine how to find y-coordinate.
Details
This data replicates respondents' stated price at which they find the product's price to be too.cheap (i.e., they question its quality) and the price at which they feel the product is a bargain.
The data will be cleaned before use to ensure that too.cheap is
always less than the bargain price.
The cumulative frequency for the
bargain price will be inverted to become not.bargain.
The intersection of bargain and too.cheap will represent the point at
which an equal share of respondents feel the price is not a bargain
and too.cheap --- the point of marginal cheapness ("pmc").
Getting to the point where I'm having a challenge will take a number of steps.
Step 1: Generate some data
# load libraries for all steps
library(car)
library(ggplot2)
# function that generates the data
so.create.test.dataset <- function(n, mean){
step.to.bargain <- round(rnorm(n = n, 3, sd = 0.75), 2)
price.too.cheap <- round(rnorm(n = n, mean = mean, sd = floor(mean * 100 / 4) / 100), 2)
price.bargain <- price.too.cheap + step.to.bargain
df.temp <- cbind(price.too.cheap,
price.bargain)
df.temp <- as.data.frame(df.temp)
return(df.temp)
}
# create 389 "observations" where the too.cheap has a mean value of 10.50
# the function will also create a "bargain" price by
#adding random values with a mean of 3.00 to the too.cheap price
so.test.df <- so.create.test.dataset(n = 389, mean = 10.50)
Step 2: Create a data frame of cumulative frequencies
so.get.count <- function(p.points, p.vector){
cc.temp <- as.data.frame(table(p.vector))
cc.merged <- merge(p.points, cc.temp, by.x = "price.point", by.y = "p.vector", all.x = T)
cc.extracted <- cc.merged[,"Freq"]
cc.extracted[is.na(cc.extracted)] <- 0
return(cc.extracted)
}
so.get.df.price<-function(df){
# creates cumulative frequencies for three variables
# using the price points provided by respondents
# extract and sort all unique price points
# Thanks to akrun for their help with this step
price.point <- sort(unique(unlist(round(df, 2))))
#create a new data frame to work with having a row for each price point
dfp <- as.data.frame(price.point)
# Create cumulative frequencies (as percentages) for each variable
dfp$too.cheap.share <- 1 - (cumsum(so.get.count(dfp, df$price.too.cheap)) / nrow(df))
dfp$bargain.share <- 1 - cumsum(so.get.count(dfp, df$price.bargain)) / nrow(df)
dfp$not.bargain.share <- 1 - dfp$bargain.share# bargain inverted so curves will intersect
return(dfp)
}
so.df.price <- so.get.df.price(so.test.df)
Step 3: Estimate the curves for the cumulative frequencies
# Too Cheap
so.l <- lm(logit(so.df.price$too.cheap.share, percents = TRUE)~so.df.price$price.point)
so.cof.TCh <- coef(so.l)
so.temp.nls <- nls(too.cheap.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.TCh[1], b = so.cof.TCh[2]), data = so.df.price, trace = TRUE)
so.df.price$Pr.TCh <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
# Thanks to John Fox & Sanford Weisberg - "An R Companion to Applied Regression, second edition"
At this point, we can plot and compare the "observed" cumulative frequencies against the estimated frequencies
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
geom_line(aes(y = so.df.price$too.cheap.share, colour = "too.cheap.share"))+
geom_line(aes(y = so.df.price$not.bargain.share, colour = "not.bargain.share"))+
scale_y_continuous(name = "Cummulative Frequency")
The estimate appears to fit the observations reasonably well.
Step 4: Find the intersection point for the two estimate functions
so.f <- function(x, a, b){
# model for the curves
1 / (1 + exp(-(a + b * x)))
}
# note, this function may also be used in step 3
#I was building as I went and I don't want to risk a transpositional error that breaks the example
so.pmc.x <- uniroot(function(x) so.f(x, so.cof.TCh[1], so.cof.TCh[2]) - so.f(x, so.cof.Br[1], so.cof.Br[2]), c(0, 50), tol = 0.01)$root
We may visually test the so.pmc.x by plotting it with the two estimates. If it is correct, a vertical line for so.pmc.x should pass through the intersection of too.cheap and not.bargain.
ggplot(data = so.df.price, aes(x = price.point)) +
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap")) +
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain")) +
scale_y_continuous(name = "Cumulative Frequency") +
geom_vline(aes(xintercept = so.pmc.x))
...which it does.
Step 5: Find y
Here is where I get stumped, and I'm sure I'm overlooking something very basic.
If a curve is defined by f(x) = 1/(1+exp(-(a+bx))), and a, b and x are all known, then shouldn't y be the result of 1/(1+exp(-(a+bx))) for either estimate?
In this instance, it is not.
# We attempt to use the too.cheap estimate to find y
so.pmc.y <- so.f(so.pmc.x, so.cof.TCh[1], so.cof.TCh[2])
# In theory, y for not.bargain at price.point so.pmc.x should be the same
so.pmc.y2 <- so.f(so.pmc.x, so.cof.NBr[1], so.cof.NBr[2])
EDIT: This is where the error occurs (see solution below).
a != so.cof.NBr[1] and b != so.cof.NBr[2], instead a and be should be defined as the coefficients from so.temp.nls (not so.l)
# Which they are
#> so.pmc.y
#(Intercept)
# 0.02830516
#> so.pmc.y2
#(Intercept)
# 0.0283046
If we calculate the correct value for y, a horizontal line at yintercept = so.pmc.y, should pass through the intersection of too.cheap and not.bargain.
...which it obviously does not.
So how does one estimate y?
I've solved this, and as I suspected, it was a simple error.
My assumption that y = 1/(1+exp(-(a+bx))) is correct.
The issue is that I was using the wrong a, b coefficients.
My curve was defined using the coefficients in so.cof.NBr as defined by so.l.
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
But the resulting curve is so.temp.nls, NOT so.l.
Therefore, once I find so.pmc.x I need to extract the correct coefficients from so.temp.nls and use those to find y.
# extract coefficients from so.temp.nls
so.co <- coef(so.temp.nls)
# find y
so.pmc.y <- 1 / (1 + exp(-(so.co[1] + so.co[2] * so.pmc.x)))
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
scale_y_continuous(name = "Cumulative Frequency")+
geom_hline(aes(yintercept = so.pmc.y))
Yielding the following...
which graphically depicts the correct answer.
I am trying to label a PCA biplot with sample names rather then the standard numbers. I am using the codes:
PRCOMP1 <- prcomp(~ Max + Min + Range + Average + P10 + P20 +
P50 + P100 + D10 + D20 + D50 + D100 + D500,
data = turbidity,
na.action = na.omit,
scale = TRUE
biplot(PRCOMP1, cex = 0.8, choices=c(1,2))
which provides the below plot - does any know I can label the points with a column labelled Sample in my datasheet.
Also is there a easy way to change the colour of the arrows? Any help would be much appreciated.
You would name the rows of your input data with the value of the Sample column:
row.names(turbidity) <- turbidity$Sample
The dots on your biplot will then be labelled with their cognate sample name.
I try with an example:
#creating an example data frame with 5 numeric and one character variables
mydata1 <- as.data.frame(matrix(rnorm(100, 0, 2), ncol = 5))
mydata1$sample <- c(sapply(1:20, function(i) paste("s", i, sep = "")))
#view of the df
mydata1
V1 V2 V3 V4 V5 sample
1 1.7398057 -0.8074246 0.009826488 0.58566480 3.88569625 s1
2 -1.3259889 -2.4359229 -1.258855445 2.65124987 -2.64137545 s2
3 -2.3961068 -0.3108402 -1.330362255 -0.35209302 -2.39282594 s3
This is a 20 rows by 6 variables dataframe
biplot(prcomp(mydata1[,-6]))
This statement will return a plot without the sample label, only numbers.
#naming rows of the df with the sample column value
row.names(mydata1) <- mydata1$sample
#viewing the df
head(mydata1)
V1 V2 V3 V4 V5 sample
s1 1.739806 -0.8074246 0.009826488 0.5856648 3.8856962 s1
s2 -1.325989 -2.4359229 -1.258855445 2.6512499 -2.6413755 s2
s3 -2.396107 -0.3108402 -1.330362255 -0.3520930 -2.3928259 s3
#plotting
biplot(prcomp(mydata1[,-6]))
The latter plot will now render the observations with their labels.
Let me know if that is what you had in mind.