Group results under data names with sapply function and plot in R - r

Let's say I have a data which involves 3 separate data. Here is my data;
data<-structure(list(x = structure(list(value = c(2L, 4L, 5L, 6L, 9L,
4L, 3L, 2L, 10L, 6L)), .Names = "value", class = "data.frame", row.names = c(NA,
-10L)), y = structure(list(value = c(2, 2.1, 4, 3, 0, 1.2, 4.2,
3, 4, 9)), .Names = "value", class = "data.frame", row.names = c(NA,
-10L)), z = structure(list(value = c(1, 2, 7, 6, 0.3, 5.4, 4,
3, 6, 7)), .Names = "value", class = "data.frame", row.names = c(NA,
-10L))), .Names = c("x", "y", "z"))
And here is my sample function;
sam<-function(x) {
ex<-c(3,2,4,5,2)
z<-data.frame(x)
y<-as.matrix(sapply(z, as.numeric))
h<-lapply(c(2,5,10), function(xx) tapply(y, as.integer(gl(nrow(x), xx, nrow(x)) ), FUN = sum))
names(h)<-c("min2", "min5", "min10")
min2<-h[[1]]
pdf("plots.pdf")
plot(min2, ex, main="min. compare",
xlab="Historical Values ", ylab="Disaggregated Values", pch=19, col = "blue")
dev.off()
return(h)
}
In the function, I am aggregating values as shown. And then plotting min2 with ex data.
With the code below, I tried to use the function for all data like;
v1<-sapply(data, sam)
But I can not see calculation's name as min2 min5 min10 in result list. And also results are coming complexly, not under the x, y and z
I desire these two;
1) Grouping results under each data name. Like;
[x] [y] [z]
min2 min2 min2
min5 min5 min5
min10 min10 min10
2) Plotting the desired ones for all x, y and z as mentioned above. And export three plots to one pdf or separately.

To get the output, like #JonnyPhelps suggested, use lapply instead of sapply. To make the plots and get correlation you need to alter the function:
sam<-function(x) {
ex<-c(3,2,4,5,2)
z<-data.frame(x)
y<-as.matrix(sapply(z, as.numeric))
h<-lapply(c(2,5,10), function(xx) tapply(y, as.integer(gl(nrow(x), xx, nrow(x)) ), FUN = sum))
names(h)<-c("min2", "min5", "min10")
min2<-h[[1]]
plot(min2, ex, main="min. compare",
xlab="Historical Values ", ylab="Disaggregated Values",
pch=19, col = "blue")
COR = cor.test(min2,ex)
LABEL = paste("cor=",signif(COR$estimate,3),"\np=",signif(COR$p.value,3))
mtext(LABEL,side=3,padj=2)
return(h)
}
The correlation is calculated and you use mtext to place it at the top of the plot. You can play around with padj and adj to get the text where you need.
In your previous function, you called the plot in the function, this overwrites the file with every iteration. To plot all on a pdf, you need to do:
pdf("plots.pdf")
v1<-lapply(data, sam)
dev.off()
Or if you want them on the same page:
pdf("plots.pdf",width=8,height=4)
par(mfrow=c(1,3))
v1<-lapply(data, sam)
dev.off()

Related

How to plot many probability density functions (pdfs) without sharp edges?

I have an issue with plotting continuous distributions without sharp edges in ggplot2. I need to show two of them on one plot. Also, it does not have to be ggplot2 to achieve this result.
I am aware, that the number of data points directly influences the smoothness of the line, but it was not the case here. Below you can see some sample data (from dput)
sample.data<-list(beta.data = structure(list(cluster = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), beta.density = c(0, 3.42273368363934e-43, 8.42987148403021e-29,
2.04764468657484e-20, 1.69485562831516e-14, 6.07999638837842e-10, 2.88180370232676e-06, 0.00314064636750876, 0.954118897015866, 0, 0, 3.80101893822358e-36, 6.43342582657081e-22, 6.82956252277493e-14, 1.75291058931833e-08, 0.000131874335695378, 0.0750918340641428, 3.72532418759802, 5.05242078519544, 0), pr = c(0, 0.111111111111111, 0.222222222222222, 0.333333333333333, 0.444444444444444, 0.555555555555556, 0.666666666666667, 0.777777777777778, 0.888888888888889, 1, 0, 0.111111111111111, 0.222222222222222, 0.333333333333333, 0.444444444444444, 0.555555555555556, 0.666666666666667, 0.777777777777778, 0.888888888888889, 1)), row.names = c(NA, -20L), class = "data.frame"), beta.params = structure(list(cluster = 1:2, a = c(49, 50), b = c(2, 10), ni.beta = c(0.961,0.833), sd.beta = c(0.00072, 0.00228)), row.names = c(NA,-2L), class = "data.frame"))
Before I was using geom_col, but it discretizes values. I went with geom_area:
ggplot(sample.data$beta.data, aes(x = pr, y = beta.density)) +
geom_area(stat = "function",
fun = dbeta,
args = list(shape1 = sample.data$beta.params[1,"a"], shape2 = sample.data$beta.params[1,"b"]),
fill = "#F1C40F",
colour = "black",
alpha = 0.7) +
geom_area(stat = "function",
fun=dbeta,
args = list(shape1 = sample.data$beta.params[2,"a"], shape2 = sample.data$beta.params[2,"b"]),
fill = "#3498DB",
colour = "black",
alpha = 0.7)
I presented you the data with 10 points, but 1000 points look almost the same. It is not the case here, where even 100 points looks ok:
p = seq(0,1, length=100)
plot(p, dbeta(p, 50, 10), ylab="Density", type ="l", col=4, , lwd = 2)
Here I am attaching code to simulate the data. Oh, and these troublesome beta parameters were a = 49 and b = 2.
len <- 100
p <- seq(0,1, length.out = len)
df <- data.frame(rbind(cbind("cl" = rep(1, times = length(p)), "beta" = dbeta(p, 50, 10),"p"= p),
cbind("cl" = rep(1, times = length(p)), "beta" = dbeta(p, 40, 2),"p"= p)))
Do you have any ideas?
EDIT: The pdfs stands here for probability density functions. That is why I have not put "pdf" as a tag. My apologies for the confusion!
Anyway, when I tried to print graphic to PDF file, the result was poor as well (sharp edges). But it the end, it shouldn't matter. I want to see smooth lines whatever I do (reasonably).
EDIT2 It is possible to achieve because:
library(mosaic)
theme_set(theme_bw())
xpbeta(c(0.7, 0.90), shape1 = 49, shape2 = 2)
It produces nice, smoothed beta dist with parameters (49, 2). But then again, I need to show two dists in one chart.
I have found the answer. It still needs some editing (like transparency/alpha which I couldn't figure out), but in general, this is what I meant. Code:
library(mosaic)
plotDist('beta', params=list(49,2), kind='density', type = "h", col = "#3498DB", xlim = c(0,1))
plotDist('beta', params=list(50, 10), kind='density', , type = "h", col = "#F1C40F", add = TRUE)
plotDist('beta', params=list(49,2), kind='density', add = TRUE, col = "black")
plotDist('beta', params=list(50, 10), kind='density', add = TRUE, col = "black")
Result:
We can add as many distributions as we want, using "add" parameter.
Parameter type = "h", is used to draw filled distribution. Without it, the only line is visible. In my answer, I draw the two lines and two filled dists. I would be really happy if someone could show a better answer, though.
EDIT:
I think I found my perfect answer!
Here is the code:
library(ggformula)
theme_set(theme_bw())
gf_dist("beta", shape1 = 49, shape2 = 2, geom = "area", alpha = 0.5, fill = "#F1C40F") %>%
gf_dist("beta", shape1 = 49, shape2 = 2) %>%
gf_dist("beta", shape1 = 50, shape2 = 10, geom = "area", alpha = 0.5, fill = "#3498DB") %>%
gf_dist("beta", shape1 = 50, shape2 = 10)
It is much faster than the previous code, parameter alpha is obvious and it is relatively easy to combine many plots! Because of transparency, you can nicely see the overlap of both distributions.

How to get a data frame into a graph, when dealing with words and numbers?

This is my data frame
id product cost
1 Milk 3
2 egg 2
3 coffee 4
4 tea 2
5 sugar 3
I am trying to work out how to get the product and its' cost into a graph, my current issue is that it will not do it because the product column is not numeric.
Thanks!
Based on your comment, you can use ggplot2 to create a barplot. Like this:
library(ggplot2)
ggplot(data = df, aes(x = product, y = cost)) +
geom_bar(stat = "identity") + coord_flip()
data
df <- structure(list(id = c(1, 2, 3, 4, 5), product = structure(c(3L, 2L, 1L, 5L, 4L), .Label = c("coffee", "egg", "Milk", "sugar", "tea"), class = "factor"), cost = c(3, 2, 4, 2, 3)), class = "data.frame", row.names = c(NA, -5L))
We define the column "product" as a factor, like so:
data <- data.frame(id = c(1, 2, 3, 4, 5),
product = c("Milk", "egg", "coffee", "tea", "sugar"),
cost = c(3, 2, 4, 2, 3)
)
plot(x = as.factor(data$product),
y = data$cost,
type = "p"
)

Putting x-axis labels directly under tick marks in barplots in R

I have a table (below) showing the percentage of tree species (categorical variable) present in a group experiment. My objective is to plot the percentage of tree species on the y-axis and 'Species' on the x-axis within a barplot.
Issue
My problem is that I am experiencing problems with formatting the x-axis correctly. My objective is to ensure that the x-axis labels for**'Species'** are:-
Positioned directly underneath their bar at the tick mark
Do not overlap onto the plotting area
If anyone can help solve this issue, I would be incredibly grateful.
R code
df <- leaf.percent[order(leaf.percent$Leaf.Percentge, decreasing = TRUE),]
Tree.labels<-c("Quercus robar", "Quercus Patraea",
"Deciduous", "Oak",
"Plant", "Shrub")
par(mar=c(6, 6, 3, 3))
Tree<-barplot(df$Leaf.Percentge, names.arg = df$Species,
xaxt = "n",
ylab="Percentage %",
xlab="Tree Species",
col="lightblue",
ylim = c(0, 60))
axis(1, at=Tree, labels=FALSE)
text(seq(1, 6, by=1), par("usr")[3] - 0.2,
labels=unique(Tree.labels),
srt = 25, pos = 1,
xpd = TRUE, cex=0.7)
DATA
structure(list(Species = structure(1:6, .Label = c("Deciduous",
"Oak", "Plant", "Quercus_petraea", "Quercus_robur", "Shrub"), class = "factor"),
Frequency = c(48L, 29L, 6L, 70L, 206L, 4L), Leaf.Percentge = c(13.2231404958678,
7.98898071625344, 1.65289256198347, 19.2837465564738, 56.7493112947658,
1.10192837465565)), .Names = c("Species", "Frequency", "Leaf.Percentge"
), row.names = c(NA, -6L), class = "data.frame")

Create new column with percentages in data frame

I have the following dataframe:
dput(df1)
structure(list(month = c(1, 1, 2, 2, 3, 4), transaction_type = c("AAA",
"BBB", "BBB", "CCC",
"DDD", "AAA"), max_wt_per_month = c(54.9,
51.6833333333333, 52.3333333333333, 49.4666666666667, 49.85,
48.5833333333333), min_wt_per_month = c(0, 0, 0, 0, 0, 0), avg_wt_per_month = c(8.41701333107861,
7.65211141060198, 6.44184012508551, 7.74798927613941, 7.4360566888844,
7.50611319574734), prop = c(Inf, Inf, Inf, Inf, Inf, Inf)), .Names = c("month",
"transaction_type", "max_wt_per_month", "min_wt_per_month", "avg_wt_per_month",
"prop"), row.names = c(NA, -6L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = list(month), drop = TRUE, indices = list(
0:5), group_sizes = 6L, biggest_group_size = 6L, labels = structure(list(
month = 1), row.names = c(NA, -1L), class = "data.frame", vars = list(
month), drop = TRUE, .Names = "month"))
I want to create column prop that would contain the percentage of maximum waiting time with respect to each month. If I run this code, then I get Inf values in most of the rows... (especially it is evident in the real dataset):
my_fun=function(vec){
100*as.numeric(vec[3]) /
sum(with(data_merged_transactions, ifelse(month == vec[1], max_wt_per_month, 0))) }
data_merged_transactions$prop=apply(data_merged_transactions , 1 , my_fun)
I then finally need to create the filled area chart so that each area would be a percentage out of 100%:
ggplot(data_merged_transactions, aes(x=month, y=prop, fill=transaction_type)) +
geom_area(alpha=0.6 , size=1, colour="black")
Why do I get Inf if the sum is not equal to 0?
Moreover, is it possible to create filled area chart with months being factors (Jan, Feb,etc.), not numbers? I tried to substitute month id's by month names, but then I got very thin bars instead of a filled area.
Is this what you were looking for?
library(tidyverse)
df1_tidy <- df1 %>%
group_by(month) %>%
summarise(SUM = sum(max_wt_per_month)) %>%
full_join(df1) %>%
mutate(prop = max_wt_per_month / SUM)
ggplot(data = df1_tidy,
aes(x = month,
y = prop,
fill = transaction_type)) +
geom_area(alpha = 0.6,
size = 1,
colour = "black") +
scale_x_continuous(labels = c("Jan", "Feb", "Mar", "Apr"))

How to change the color of dendrogram for each group in a cluster

Here is my data
df<- structure(list(name = structure(c(2L, 12L, 1L, 16L, 14L, 10L,
9L, 5L, 15L, 4L, 8L, 13L, 7L, 6L, 3L, 11L), .Label = c("All",
"Bab", "boro", "bra", "charli", "delta", "few", "hora", "Howe",
"ist", "kind", "Kiss", "myr", "No", "TT", "where"), class = "factor"),
value = c(1.251, -1.018, -1.074, -1.137, 1.018, 1.293, 1.022,
-1.008, 1.022, 1.252, -1.005, 1.694, -1.068, 1.396, 1.646,
1.016)), .Names = c("name", "value"), class = "data.frame", row.names = c(NA,
-16L))
here what I do
d <- dist(as.matrix(df$value),method = "euclidean")
#compute cluster membership
hcn <- hclust(d,method = "ward.D2")
plot(hcn)
and it gives me what I want as follows
Here all groups are shown by black color and the dendrogram is not that clear what I want is to change the color of each group and also use the name in vertical instead the number and finally I want to be able to remo the hclust(."ward.D2") while change the x label and y label as I want
You could use the dendextend package, aimed for tasks such as this:
# install the package:
if (!require('dendextend')) install.packages('dendextend'); library('dendextend')
## Example:
dend <- as.dendrogram(hclust(dist(USArrests), "ave"))
d1=color_branches(dend,k=5, col = c(3,1,1,4,1))
plot(d1) # selective coloring of branches :)
d2=color_branches(d1,k=5) # auto-coloring 5 clusters of branches.
plot(d2)
# More examples are in ?color_branches
You can see many examples in the presentations and vignettes of the package, in the "usage" section in the following URL: https://github.com/talgalili/dendextend
Or you can use also:
You should use dendrapply.
For instance:
# Generate data
set.seed(12345)
desc.1 <- c(rnorm(10, 0, 1), rnorm(20, 10, 4))
desc.2 <- c(rnorm(5, 20, .5), rnorm(5, 5, 1.5), rnorm(20, 10, 2))
desc.3 <- c(rnorm(10, 3, .1), rnorm(15, 6, .2), rnorm(5, 5, .3))
data <- cbind(desc.1, desc.2, desc.3)
# Create dendrogram
d <- dist(data)
hc <- as.dendrogram(hclust(d))
# Function to color branches
colbranches <- function(n, col)
{
a <- attributes(n) # Find the attributes of current node
# Color edges with requested color
attr(n, "edgePar") <- c(a$edgePar, list(col=col, lwd=2))
n # Don't forget to return the node!
}
# Color the first sub-branch of the first branch in red,
# the second sub-branch in orange and the second branch in blue
hc[[1]][[1]] = dendrapply(hc[[1]][[1]], colbranches, "red")
hc[[1]][[2]] = dendrapply(hc[[1]][[2]], colbranches, "orange")
hc[[2]] = dendrapply(hc[[2]], colbranches, "blue")
# Plot
plot(hc)
I get this information from: How to create a dendrogram with colored branches?
We could instead draw rectangles around groups, let's say there are 5 groups(k = 5):
# plot dendogram
plot(hcn)
# then draw dendogram with red borders around the 5 clusters
rect.hclust(hcn, k = 5, border = "red")
EDIT:
Remove x axis label, and add names instead of numbers:
plot(hcn, xlab = NA, sub = NA, labels = df$name)
rect.hclust(hcn, k = 5, border = "red")

Resources