How to read this graph R - r

Sorry if this may come across as really stupid but I done some clustering using K MEANS in R and plotted this graph.
Could someone please explain me how to read this plot? or the name for these types of plots so i can google further.
Thank you
PS: for clustering experts can you identify any meaningful clusters from this plot?

You effectively ignored all attributes except quantity.
The result is meaningless.
K-means only works well, when every axis has a comparable distribution.
Restart from the beginning, with careful preprocessing, scaling, and understanding your data.

Related

How to add zoom option for wordcloud in Shiny (with reproducible example)

Could you please help me to add zooming option for wordcloud
Please find reproducible example #
´http://shiny.rstudio.com/gallery/word-cloud.html´
I tried to incorporate rbokeh and plotly but couldnt find wordcloud equivalent render function
Additionally, I found ECharts from github #
´https://github.com/XD-DENG/ECharts2Shiny/tree/8ac690a8039abc2334ec06f394ba97498b518e81´
But incorporating this ECharts are also not convenient for really zoom.
Thanks in advance,
Abi
Normalisation is required only if the predictors are not meant to be comparable on the original scaling. There's no rule that says you must normalize.
PCA is a statistical method that gives you a new linear transformation. By itself, it loses nothing. All it does is to give you new principal components.
You lose information only if you choose a subset of those principal components.
Usually PCA includes centering the data as a Pre Process Step.
PCA only arranges the data in its own Axis (Eigne Vectors) System.
If you use all axis you lose no information.
Yet, usually we want to apply Dimensionality Reduction, intuitively, having less coordinates for the data.
This process means projecting the data into Sub Space which is spanned by only some of the Eigen Vectors of the data.
If one chose wisely the number of vectors one might end up with a significant reduction in the number of dimensions of the data with negligible loss of data / information.
The way to do so is by choosing Eigen Vectors which their Eigen Values sum to most of the data power.
PCA itself is invertible, so lossless.
But:
It is common to drop some components, which will cause a loss of information.
Numerical issues may cause a loss in precision.

Multiple regression lines to define a set of data

I am trying to use a regression model to establish a relationship between two parameters, A and B(more specifically, runtime and workload, so that can I recommend what an optimal workload could be maybe, or how strongly one affects the other etc. ) I am using 'rlm'(robust linear model) for this purpose since it saves me the trouble of dealing with outliers before hand.
However, rather than output one single regression model, I would like to determine a band that can confidently explain most of the points. Here is an image I took from the web. Those additional red lines are what I want to determine.
This is what I had in mind :
1. I found the mean of the residuals of all the points lying above the line. Then we probably shift the original regression line by some multiple of mean + k*sigma. The same can be done for the points below the line.
In SVM, in order to find the support vectors, we draw parallel lines(essentially shift the middle line until we find support vectors on either sides). I had something like that in mind. Play around with the intercepts a little and find the the number of points which can be explained by the band. Keep a threshold so you can stop somewhere.
The problem is, I am unable to implement this in R. For that matter, I am not sure if these approaches even work either. I would like to know what you would suggest. Also, is there a classic way to do this using one of the many R packages?
Thanks a lot for helping. Appreciate it.

Clustering time series in R

i have a problem with clustering time series in R.
I googled a lot and found nothing that fits my problem.
I have made a STL-Decomposition of Timeseries.
The trend component is in a matrix with 64 columns, one for every series.
Now i want to cluster these series in simular groups, involve the curve shapes and the timely shift. I found some functions that imply one of these aspects but not both.
First i tried to calculte a distance matrix with the dtw-distance so i
found clusters based on the values and inply the time shift but not on the shape of the timeseries. After this i tried some correlation based clustering, but then the timely shift
we're not recognized and the result dont satisfy my claims.
Is there a function that could cover my problem or have i to build up something
on my own. Im thankful for every kind of help, after two days of tutorials and examples i totaly uninspired. I hope i could explain the problem well enough to you.
I attached a picture. Here you can see some example time series.
There you could see the problem. The two series in the middle are set to one cluster,
although the upper and the one on the bottom have the same shape as one of the middle.
Have you tried the R package dtwclust
https://cran.r-project.org/web/packages/dtwclust/index.html
(I'm just starting to explore this package, but it seems like it covers a lot of aspects of time series clustering and it has lots of good references.)
you can use the kml package. It is used specifically to longitudinal data. You can consult its help. It has the next example:
### Generation of some data
cld1 <- generateArtificialLongData(25)
### We suspect 3, 4 or 6 clusters, we want 3 redrawing.
### We want to "see" what happen (so printCal and printTraj are TRUE)
kml(cld1,c(3,4,6),3,toPlot='both')
### 4 seems to be the best. We want 7 more redrawing.
### We don't want to see again, we want to get the result as fast as possible.
kml(cld1,4,10)
Example cluster

Finding patterns through better visualization in R

I have the following time series data. It has 60 data points shown below. Please see a simple plot of this data below. I am using R for plotting this. I think that if I draw a moving average curve on the points in the graph, then we can better understand the patterns in the data. I don't know how to do it in R. Could some one help me to do that. Additionally, I am not sure whether this is a good way to identify patterns or not. Please also suggest me if there is any better way. Thank you.
x <- c(18,21,18,14,8,14,10,14,14,12,12,14,10,10,12,6,10,8,
14,10,10,6,6,4,6,2,8,6,2,6,4,4,2,8,6,6,8,12,8,8,6,6,2,2,4,
4,4,8,14,8,6,6,2,6,6,4,4,8,6,6)
To answer your question about moving averages, you could accomplish it with the help of rollmean which is in package zoo.
From Joshua's comment: You could also look into TTR package that depends on xts that depends on zoo. Also, there are other moving averages in the package TTR: check ?MA.
require(TTR)
# assuming your vector is loaded in dat
# sliding window / moving average of size 5
dat.k5 <- rollmean(dat, k=5)
One reasonable possibility:
d <- data.frame(x=scan("tmp.dat"))
qplot(x=seq(nrow(d)),x,data=d)+geom_smooth(method="loess")
edit: moved from comment to answer, based on https://meta.stackexchange.com/questions/164783/why-was-a-seemingly-relevant-non-offensive-comment-removed
With regard to "is this a good way to identify patterns" (which is a little off-topic for StackOverflow, but whatever); I think rolling means are perfectly respectable, although more sophisticated methods (such as the locally-weighted regression [loess/lowess] shown here) do exist. However, it doesn't look to me as though there is much of a complicated pattern to detect here: the data seem to initially decline with time, then level off. Rolling means and more sophisticated approaches may look prettier, but I don't think they will identify any deeper patterns in this data set ...
If you want to do this sort of thing for multiple data sets at once (as indicated in your comment), you may like ggplot's capabilities for automatically producing multi-line or faceted versions of the same plot.

Time series smoothing, avoiding revisions

This time my question is more methodological than technical. I have weekly time series data which gets updated every week. Unfortunately the time series is quite volatile. I would thus like to apply a filter/a smoothing method. I tried Hodrick-Prescott and LOESS. Both results look fine, with the downturn that if a new datapoint follows which diverges strongly from the historic data points, the older values have to be revised/are changing. Does somebody know a method which is implemented in R, which could do what I want? A name of a method/a function would probably be completely sufficient. It should however be something more sophisticated than a left sided moving average, because I would not like to lose data at the beginning of the time series. Every helping comment is appreciated! Thank you very much!
Best regards,
Andreas
I think (?) that the term you may be looking for is causal filtering, i.e. filtering that doesn't depend on future values. Within this category probably the simplest/best known approach is exponential smoothing, which is implemented in the forecast and expsmooth packages (library("sos"); findFn("{exponential smoothing}")).
Does that help?
It seems you need a robust two-sided smoother. The problem is that an outlier at an end-point is indistinguishable from a sudden change in the trend. It only becomes clear that it is an outlier after several more observations are collected (and even then some strong assumptions of trend smoothness are required).
I think you will find it hard to do better than loess(), but other functions that aim to do robust smoothing include
smooth() for Tukey's smoothers;
supsmu() for Friedman's super smoother;
Hodrick-Prescott smoothing is not robust to outliers.

Resources