I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.
Related
I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000
I want to add noise to a dataset. This is a fairly straightforward procedure in R. I sample from a Laplace distribution and then add/multiply/whatever that vector to the vector I want to add noise to.
The issue is, my colleague is asking for the code in SAS. I have not used SAS since graduate school and my project has been put on hold until I can get my colleague up to speed in SAS.
My code is pretty simple :
library ("rmutil")
vector <- c (1,2,3,1,2,3,1,2,3)
vector_prop <- vector/sum(vector)
noise <- rlaplace(9, m=1, s=.1)
new_vector <- vector_prop * noise
I am turning my vector I want to add noise to into a proportion, then drawing from a laplace distribution. Finally I multiply those draws with my proportion vector.
Any idea would be helpful as the SAS documentation was difficult to follow. I imagine they feel the same way with R documentation.
Assuming your data is in a data set called have with a variable called vector_prop the following code is likely correct. Because of the nature of random numbers and streams you can't replicate that though, don't you end up with a different data set each time?
data want;
set have;
call streaminit(24); *fixes random number stream for reproduciblilty;
new_var = vectorProp * rand('laplace', 1, 0.1);
run;
In the hts package, grouped time series structures are created using gts. That function calls another function named, CreateGmat, where a 2-combination
cn <- combn(1L:total.len, 2)
is calculated and used to guide the construction of the group structure which is represented by a "g-matrix". For smaller structures with fewer categories, using only a 2-combination to create the structure makes intuitive sense, but I wondering if this should always be the case.
I haven't gone through all the code, yet, but my intuition is the G-matrix determines the S-matrix which affects the Reconciliation matrix which affects the forecasts that are calculated using the Optimal Combination method.
Unless my understanding of how the reconciliation process occurs is wrong, how the G-matrix is formulated is at least somewhat important to how the forecasts are calculated. For large structures with many categories, would including combinations greater than two affect the forecasts in any significantly positive way? Is there a statistical or experimental reason for using only a 2-combination?
Your understanding about G-matrix is absolutely correct. If there are more than 3 grouping variables, we need to create a G-matrix with 2-way and 3-way interactions. I had some code to automate more than 3-way combinations for a project last year, but never ended up putting it to the hts package.
nvars <- length(chr_key)
possible_len <- seq_len(nvars - 1)
list_comb <- lapply(possible_len, function(x) combn(seq_len(nvars), x,
simplify = FALSE))
cs_comb <- c(0, cumsum(choose(nvars, possible_len)))
len_comb <- cs_comb[nvars]
Im using R to create logistic regression classifier model.
Here is the code sample:
library(ROCR)
DATA_SET <- read.csv('E:/1.csv')
classOneCount= 4000
classZeroCount = 4000
sample.churn <- sample(which(DATA_SET$Class==1),classOneCount)
sample.nochurn <- sample(which(DATA_SET$Class==0),classZeroCount )
train.set <- DATA_SET[c(sample.churn,sample.nochurn),]
test.set <- DATA_SET[c(-sample.churn,-sample.nochurn),]
full.logit <- glm(Class~., data = train.set, family = binomial)
And it works fine, but I would like to preprocess the data to see if it improves classification model.
What I would like to do would be to divide input vector variables which are continuoes into intervals. Lets say that one variable is height in centimeters in float.
Sample values of height:
183.23
173.43
163.53
153.63
193.27
and so on, and I would like to split it into lets say 3 different intervals: small, medium, large.
And do it with all variables from my set - there are 32 variables.
What's more I would like to see at the end correlation between value of the variables (this intervals) and classification result class.
Is this clear?
Thank you very much in advance
The classification model creates some decision boundary and existing algorithms are rather good at estimating it. Let's assume that you have one variable - height - and linear decision boundary. Your algorithm can then decide between what values put decision boundary by estimating error on training set. If you perform quantization and create few intervals your algorithm have fewer places to put boundary(data loss). It will likely perform worse on such cropped dataset than on original one. It could help if your learning algorithm is suffering from high variance (is overfitting data) but then you could also try getting more training examples, use smaller set (subset) of features or use algorithm with regularization and increase regularization parameter
There are also many questions about how to choose number of intervals and how to divide data into them like: should all intervals be equally frequent or of equal width or most similar to each other inside each interval?
If you want just to experiment use some software like f.e. free version of RapidMiner Studio (it can read CSV and Excel files and have some quick quantization options) to convert your data
I need to run clustering on the correlations of data row vectors, that is, instead of using individual variables as clustering predictor variables, I intend to use the correlations between the vector of variables between data rows.
Is there a function in R that does vector-based clustering. If not and I need to do it manually, what is the right data format to feed in a function such as cmeans or kmeans?
Say, I have m variables and n data rows, the m variables constitute one vector for each data row. so I have a n X n matrix for correlation or cosine. Can this matrix be plugged in the clustering function directly or certain processing is required?
Many thanks.
You can transform your correlation matrix into a dissimilarity matrix,
for instance 1-cor(x) (or 2-cor(x) or 1-abs(cor(x))).
# Sample data
n <- 200
k <- 10
x <- matrix( rnorm(n*k), nr=k )
x <- x * row(x) # 10 dimensions, with less information in some of them
# Clustering
library(cluster)
r <- pam(1-cor(x), diss=TRUE, k=5)
# Check the results
plot(prcomp(t(x))$x[,1:2], col=r$clustering, pch=16, cex=3)
R clustering is often a bit limited. This is a design limitation of R, since it heavily relies on low-level C code for performance. The fast kmeans implementation included with R is an example of such a low-level code, that in turn is tied to using Euclidean distance.
There are a dozen of extensions and alternatives available in the community around R. There are PAM, CLARA and CLARANS for example. They aren't exactly k-means, but closely related. There should be a "spherical k-means" somewhere, that is sensible for cosine distance. There is the whole family of hierarchical clusterings (which scale rather badly - usually O(n^3), with O(n^2) in a few exceptions - but are very easy to understand conceptually).
If you want to explore some more clustering options, have a look at ELKI, it should allow clustering (with various methods, including k-means) by correlation based distances (and it also includes such distance functions). It's not R, though, but Java. So if you are bound to using R, it won't work for you.