Create ggplot2 density plot from binned data? - r

I would like to visualize a distribution using ggplot2 tools like density plots and ECDFs. The challenge is that I only have binned data available and not the individual samples. That is, each row in my data frame has columns bin,count rather than individual samples. However, the bins can be quite narrow e.g. with data spread over 500 bins.
Are there some reasonable solutions? My first thought was to somehow expand each bin into individual samples by repeating the upper bound of the bin count times. I am not sure the best way to do that nor whether it is especially inadvisable.
Tips welcome!

Related

is there a function in R to quickly calculate the difference between two geom_bin2d maps?

I have a large 2-variable dataset that may be classified into 2 groups using a third variable. Overplotting is an issue, so I've resorted to visualizing my data using bin2d and other similar approaches. I would like to calculate the difference between the binned counts of the two groups and visualize that as well (e.g subtract one 2d histogram from another).
example code:
df <- diamonds
df_color_H <- filter(df,color=="H")
df_color_E <- filter(df,color=="E")
ggplot(df_color_H)+
geom_bin2d(aes(carat,price),bins=40)
ggplot(df_color_E)+
geom_bin2d(aes(carat,price),bins=40)
Ultimately, I want to visualize the difference between overlapping bins. I know the solution is likely a pre-processing step before bringing them into GGplot but I haven't found exactly what I'm looking for. I also don't need a sophisticated solution using KDEs or something like that.
Any suggestions would be welcome!

R survift() - separate plots for each level of IV

I'm very new to R and to survival analyses. I'm trying to plot survival curves for each level of a categorical IV. Importantly, I want them plotted on separate plots. Is there an easy way to do this? An obvious solution is to just create separate data frames for the categories, however, that seems rather cumbersome.
my code:
figure_site <- survfit(survObject~site, data = uis_data)
plot(figure_site)
site has 2 levels (0, 1) and I want two plots - one for site==0 and site==1.
Thanks!

is there a way to preserve the clustering in a heatmap but reduce the number of observations?

I have data-set with 90 observations(rows) across 20 columns. I have generated a pretty neat heatmap which clusters my data in two groups with the package pheatmap. Although its not entirely clean but the two clusters of dendrogram pretty much separates my samples in 2 distinct groups as per my conditions. Now I want to reduce this set of 90 to a stricter set around 20-30 obeservations but still want to preserve the same clustering order as shown in pheatmap. Is there a way to do that? or any other package that reduces my observations to a minimum set which can still preserve by clustering order as seen now? The code for pheatmap is
pheatmap(mydata[rownames(df.90),],scale="row",clustering_distance_cols = "correlation",show_rownames= T,show_colnames=T,color=col,annotation=batch.annotation,cluster_col=T,fontsize_row = 8,fontsize_col = 8,clustering_method = "ward.D2",border_color = NA,)
any package in R that I am missing out can handle such or even something in the pheatmap I can use as a function for reducing the variables and make a kind of permutation test to find the minimum set of observations that can still retain my clustering
The data is genes in rows and expression in columns across patients.
I would like to answer my own question and want feedback. I used the kmeans_k=30 in the pheatmap and obtained 29 clusters that are still able to preserve my clustering of the 90 observations that I made previously. From there I obtained the genes in their respective clusters. I selected the top 5 clusters from that heatmap on either side of the observations that can still produce my required heatmap since they are the ones having high SD. Since all through my pheatmap I have scale="row" and kept both row dendrogram and col dendrogram on, I did not want to change them even now. So when I now plot this 31 genes(observations) in fact they improve my row clustering even more and totally partitions them in 2 groups in a more cleaner way as I wanted. Codes for kemans and new heatmap
with kmeans 30
obj<-pheatmap(df.90,scale="row",clustering_distance_cols = "correlation",show_rownames= T,show_colnames=T,color=col,annotation=batch.annotation,cluster_col=T,fontsize_row = 6,fontsize_col = 7,clustering_method = "ward.D2",border_color = NA,cellwidth = NA,cellheight = NA,kmeans_k = 30)
retrieve the clusters and extract the observations/genes
obj$kmeans$cluster
obtaining the top clusters and plot them with the heatmap
pheatmap(mydata[rownames(df.31),],scale="row",clustering_distance_cols = "correlation",show_rownames= T,show_colnames=T,color=col,annotation=batch.annotation,cluster_col=T,fontsize_row = 8,fontsize_col = 8,clustering_method = "ward.D2",border_color = NA,)
What you guys think of this approach? It is not like the one I intended but it is also not wrong I think. I would like to have feedback if someone can give a better method or approach or if they think it is also not correct. Thanks

Adding multiple lines to plot, without ggplot

I would like to plot multiple lines on the same plot, without using ggplot.
I have scores for different individuals across a set time period and wish to plot a line between yearly scores for each individual. Data is organised with each row representing an individual and each column an observed value in a given year.
Currently I am using a for loop, but am aware that this is often not efficient in R, and am interested if there are any more suitable approaches available within base R.
I will be working with up 100,000 individuals
Thanks.
Code:
df=data.frame(runif(10,0,100),runif(10,0,100),runif(10,0,100),runif(10,0,100))
df=data.frame(t(df))
Years=seq(1,10,1)
plot(1,type="n",xlab="Year",ylab="Score", xlim=c(1,10), ylim=c(0,100))
for(x in 1:4){lines(Years,df[x,])}
Efficiency is not much of a consideration when plotting since plotting to a device is a slow operation in itself. You can use matplot (which uses a loop internally). It's basically a more sophisticated version of your code wrapped in a function.
matplot(Years, t(df), xlab="Year", ylab="Score", type = "l")

How to structure data for R?

So... newbie R user here. I have some observations that I'd like to record using R and be able to add to later.
The items are sorted by weights, and the number at each weight recorded. So far what I have looks like this:
weights <- c(rep(171.5, times=1), rep(171.6, times=2), rep(171.7, times=4), rep(171.8, times=18), rep(171.9, times=39), rep(172.0, times=36), rep(172.1, times=34), rep(172.2, times=25))
There will be a total of 500 items being observed.
I'm going to be taking additional observations over time to (hopefully) see how the distribution of weights changes with use/wear. I'd like to be able plots showing either stacked histograms or boxplots.
What would be the best way to format / store this data to facilitate this kind of use case? A matrix, dataframe, something else?
As other comments have suggest, the most versatile (and perhaps useful) container (structure) for your data would be a data frame - for use with the library(ggplot2) for your future plotting and graphing needs(such as BoxPlot with ggplot and various histograms
Toy example
All the code below does is use your weights vector above, to create a data frame with some dummy IDs and plot a box and whisker plot, and results in the below plot.
library(ggplot2)
IDs<-sample(LETTERS[1:5],length(weights),TRUE) #dummy ID values
df<-data.frame(ID=IDs,Weights=weights) #make data frame with your
#original `weights` vector
ggplot(data=df,aes(factor(ID),Weights))+geom_boxplot() #box-plot

Resources