Clustering with only two variables? - r

I want to cluster my two-dimensional dataset, but I couldn't figure it out. My dataset looks like below,
dt<-data.frame(x=c(rnorm(10, 2,1), rnorm(10, 6,1)), categorize=c(rep(1,10), rep(2,10)))
I just want to plot this dataset like the graph below, if I add the third value like c(1:nrow(dt)) does it work or what do you recommend me?

Related

is there a function in R to quickly calculate the difference between two geom_bin2d maps?

I have a large 2-variable dataset that may be classified into 2 groups using a third variable. Overplotting is an issue, so I've resorted to visualizing my data using bin2d and other similar approaches. I would like to calculate the difference between the binned counts of the two groups and visualize that as well (e.g subtract one 2d histogram from another).
example code:
df <- diamonds
df_color_H <- filter(df,color=="H")
df_color_E <- filter(df,color=="E")
ggplot(df_color_H)+
geom_bin2d(aes(carat,price),bins=40)
ggplot(df_color_E)+
geom_bin2d(aes(carat,price),bins=40)
Ultimately, I want to visualize the difference between overlapping bins. I know the solution is likely a pre-processing step before bringing them into GGplot but I haven't found exactly what I'm looking for. I also don't need a sophisticated solution using KDEs or something like that.
Any suggestions would be welcome!

Complexheatmap zoom annotation : Cluster wise boxplot

I want show cluster wise boxplot distribution from complexheatmap. I was able to do row-wise distribution but how do I implement the cluster-wise distribution attached as example.
In the dummy example it creates a subgroup which it shows in the distribution. Similar manner I have already in my datafile made cluster which is represented in the first column.
How do I implement this in my dataframe using this example code
I'm not sure how do I make subgroup in case of my dataframe.
Any suggestion or help would be really appreciated.
This is the output i would like to see:
This is the output I have:
The dataset is this one: small_data
And my code:
df <- read.csv("small_data.txt",header = TRUE)
heat <- t(scale(t(df[,3:ncol(df)])))
myBreaks <- seq(-1.5, 1.5, length.out=100)
hmap <- Heatmap(heat)
hmap
How do i implement the cluster specific distribution ? as it is shown in the first pic. The second figure is what I'm getting now

How to remove outliers from a dataset using bivariate boxplot

I have a data-set (See below) that is made up of multiple variables, two of these are 'manu' and 'popul' and they both contain numeric values.
From this data I plotted a bivariate boxplot using 'manu' and 'popul' so that I could find outliers between these two variables. This is what it looks like:
Then from this plot I can see there are a few outliers, I was able to identify what values are outliers using the code below, I will also show which values are outliers:
What I would like to know is how do I now take the row that contains these values and remove them from the dataset?
Thanks in advance.
You can extract the desired subset like so:

R: How can I plot the average curve of a set of curve using ggplot2 in R

I'm relative new in R language and I'm trying to plot the average of a set of curves, for example in the picture below I have 3 curves and I need to plot the average of the 3 curves. What aproximations can I take to solve this?
Graph
My data is structured this way:
All sensors are in a different data frame
The structure of the data frame
Any help is welcome and also if someone has feedback about my question is welcome too, as I'm new in stackoverflow.
Thanks
With your data in dataframe df, with time down rows and Leyenda across columns:
df$mean <- rowMeans(df[,1:3])
But please do provide an example of your data in future.

How to structure data for R?

So... newbie R user here. I have some observations that I'd like to record using R and be able to add to later.
The items are sorted by weights, and the number at each weight recorded. So far what I have looks like this:
weights <- c(rep(171.5, times=1), rep(171.6, times=2), rep(171.7, times=4), rep(171.8, times=18), rep(171.9, times=39), rep(172.0, times=36), rep(172.1, times=34), rep(172.2, times=25))
There will be a total of 500 items being observed.
I'm going to be taking additional observations over time to (hopefully) see how the distribution of weights changes with use/wear. I'd like to be able plots showing either stacked histograms or boxplots.
What would be the best way to format / store this data to facilitate this kind of use case? A matrix, dataframe, something else?
As other comments have suggest, the most versatile (and perhaps useful) container (structure) for your data would be a data frame - for use with the library(ggplot2) for your future plotting and graphing needs(such as BoxPlot with ggplot and various histograms
Toy example
All the code below does is use your weights vector above, to create a data frame with some dummy IDs and plot a box and whisker plot, and results in the below plot.
library(ggplot2)
IDs<-sample(LETTERS[1:5],length(weights),TRUE) #dummy ID values
df<-data.frame(ID=IDs,Weights=weights) #make data frame with your
#original `weights` vector
ggplot(data=df,aes(factor(ID),Weights))+geom_boxplot() #box-plot

Resources