i have a text file (tab delimited) and it has 3 columns A, B, C:
A B C
0.07142857142857142 0.35714285714285715 0.21428571428571427
0.0 0.3333333333333333 0.3888888888888889
0.07142857142857142 0.35714285714285715 0.21428571428571427
0.0 0.3333333333333333 0.3888888888888889
Each row represents a sample with 3 different percentages A, B and C. In total I have 4 files for 4 different organisms.
There can be more than a million rows per file.
My idea is to plot each row in order to see the distribution of the pairs of points (A,B,C) in a given file and then to determine what is the most frequent pair in a given file and then compare the 4 files.
I tried plotting these points in R (multi-curves in a same graph: A, B, C in the y axis and the number of sample in the x axis) for each file but there are so many points that basically the graph can't be interpreted. Also for the million rows file, R crashes and won't plot the points.
What would be the best approach to represent these points? Also is the mode function enough to determine the most frequent pair (A,B,C) or is there any appropriate statistic test I could try to do so?
Any help would be much appreciated.
Thanks.
As I mentioned in my comment, clustering may be a solution to your problem. Here is one way of clustering using kmeans:
irisCl <- transform(iris, Cluster = kmeans(iris[1:4],3)$cluster)
library(ggplot2)
qplot(Sepal.Length, Sepal.Width, data=irisCl, colour=Species) + facet_grid(~Cluster)
Note that we have clustered in a 4-dimensional variable space. As you can see, the setosa are identified correctly in the first cluster, the second cluster contains only virginica, but the third cluster contains a mixture of versicolor and virginica.
Related
I am using a multibeam echosounder to create a raster stack in R with layers all in the same resolution, which I then convert to a data frame so I can create additive models to describe the distribution of fish around bathymetry variables (depth, aspect, slope, roughness etc.).
The issue I have is that I would like to keep my resonse variable (fish school volume) fine and my predictive variables (bathymetry) coarse, such that I have say 1 x 1m cells representing the distribution of fish schools and 10 x 10m cells representing bathymetry (so the coarse cell is divisible by the fine cell with no remainder).
I can easily create these rasters individually but relating them is the problem. As each coarser cell would contain 10 x 10 = 100 finer cells, I am not sure how to program this into R so that the values are in the right location relative to an x and a y column (for cell addresses). But I realise in this case, I would need each coarse cell value to be repeated 100 times in the data frame.
Any advice would be greatly appreciated! Thanks!
I have a dataset containing 1599 observations and 10 attributes on which iIneed to do kmeans clustering. I have done the kmeans with 6 clusters and I can see the cluster centers, size, etc. and which observation lies in which cluster. Now, I need to plot these results such that I have in a single plot the following information: On x-axis, I want 1 of the 10 attributes of my original data, on y-axis I want another attribute and in the plot, I want all 1599 observations, but I want them in 6 different colors for each cluster they belong. So, I will have 10C2 = 45 plots. Basically, this should give me the information that cluster 1 is high/medium/low in terms of a particular attribute while cluster 2 is so and so.....for all 6 clusters.
I tried the function plotcluster from fpc package but from what I understood, it maps the data into 2D, using PCA, and then plots the clusters in terms of 2 dimensions which are different from the original attributes. So now when I will say cluster 1 is low, in dim1, it wouldn't really make much sense.
Is there a function to do what I want, or should I just append the '$cluster' information from the kmeans output with my original data and try to plot taking 2 columns from my data at a time using the basic function plot()?
I suggest one solution, probably not the simplest one (with a for loop) but it seems to answer what you need:
df=mtcars
df$cluster = factor( kmeans(df, centers=6)$clust )
mycomb <- combn(1:ncol(df), 2)
for (xy in 1:45 ) {
plot(x=df[, mycomb[1,xy]],
y=df[, mycomb[2,xy]],
col=as.numeric(df$clust),
xlab=names(df)[mycomb[1,xy]],
ylab=names(df)[mycomb[2,xy]])
}
I am doing a project on K means clustering and I have a shopping dataset which has 17 variables and 2 million observations.
After running the K Means, I want to visualize all different combinations for the variables. For example A against B, B against C, C against D etc. Rather than doing it one by one, is there a way to plot all of them in one go?
I am using R for my coding. could anyone please suggest the best way to visualize all these clusters? I am looking for a pattern within the dataset.
Any help would be much appreciated.
Thank you
A
You could just simply use plot
For instance:
km <- kmeans(iris[,-5], centers=3)
plot(iris[,-5], col=km$cluster)
If you plot to a large enough image or PDF file (e.g. using the jpeg or pdf command) you can then zoom in to see individual graphs more easily.
If I have a set of points that have different y positions (A,B,C) each with the same x coordinate. Is it possible to cluster this set of 3 points together and not individually?
I'd like to see the occurrence of this set of 3 points together in a given sample and see what set (A,B,C) is most frequent.
I've seen most of the clustering algorithm can cluster points for a given position (x,y) but not a set of several points for a given x coordinate.
For instance, if i have the following
X A B C
1 0.7 0.1 0.2
2 0.3 0.4 0.1
3 0.4 0.5 0.1
4 0.7 0.1 0.2
5 0.7 0.1 0.2
6 0.2 0.1 0.5
The positions x :1, 4 and 5 should be clustered together because they have the same set (A,B,C) = (0.7,0.1,0.2).
Is there any algorithm or tool (R) that is already doing that, clustering by pair of several points, finding the most occurrent pair with a graphical visualization?
Any help would be much appreciated.
If you're looking to tabulate the instances, then something along the lines of:
tab <- table(sprintf("%s:%s:%s", df1$A, df1$B, df1$C))
which.max(tab)
sort(tab, decreasing=TRUE)
will give you the most frequent combination (you can use strsplit to separate out the individual components if you need to go on and use them programmatically.
If you're looking to cluster, in the sense of find similar distances, then you can just use
dis <- dist(as.matrix(df1[[c("A","B","C")]])
clust <- hclust(dis)
and dis will tell you all the pairwise distances (find the zero's to get the identicals), and clust will give you a tree based on similarity across A:c
If this isn't answering the question, you probably need to clarify. You say things like same x coordinate in the text, but none of your rows have the same X value. And it's fairly unconventional to switch interchangeably between y coordinate / position / (A,B,C) .
It's hard to suggest a visualisation without knowing what feature you want to emphasize. Possibly a multi-dimensional scaling graph, where each node represents all x with the same (A,B,C) triplet, and then neighbours are other X's with closest (A', B', C') values?
I'm fairly new to R but I am trying to create line graphs that monitor growth of bacteria over the course of time. I can successfully do this but the resulting graph isn't to my satisfaction. This is because I'm not using evenly spaced time increments although R plots these increments equally. Here is some sample data to give you and idea of what I'm talking about.
x=c(.1,.5,.6,.7,.7)
plot(x,type="o",xaxt="n",xlab="Time (hours)",ylab="Growth")
axis(1,at=1:5,lab=c(0,24,72,96,120))
As you can see there are 48 hours between 24 and 72 but this is evenly distributed on the graph, is there anyway I can adjust the scale to more accurately display my data?
It's always best in R to use data structures that exhibit the relationships between your data. Instead of defining growth and time as two separate vectors, use a data frame:
growth <- c(.1,.5,.6,.7,.7)
time <- c(0,24,72,96,120)
df <- data.frame(time,growth)
print(df)
time growth
1 0 0.1
2 24 0.5
3 72 0.6
4 96 0.7
5 120 0.7
plot(df, type="o")
Not sure if this produces the exact x-axis labels that you want, but you should be free to edit the graph now without changing the relationship between the growth and time variables.
x=data.frame(x=c(.1,.5,.6,.7,.7), y=c(0,24,72,96,120))
plot(x$y, x$x,type="o",xaxt="n",xlab="Time (hours)",ylab="Growth")