RadViz & survey/permutation matrix plots in R - r

I'm currently working on a brief presentation for a graduate class in multivariate data analysis. It's on methods of displaying multivariate data (for human comprehension), and of the six methods we're supposed to present on, I've taken on radial visualization plots (specifically the type referred to as "RadViz") and survey plots (which are a variety of permutation matrix visualization, or so I've been led to understand from my research). While I have been able to find sufficient resources on the uses of these visualization methods, as well as their benefits/drawbacks, I'm coming up with problems finding code to implement them in R.
I have located two user-written functions that will do survey plots and radial visualization in R. These appear to part of the package "dprep", which has since been discontinued from CRAN--and try as I might, I cannot seem to get it to install as a package when I download the older version from the archive. Additionally, all of this code is now six years and several versions out of data, and I am hesitant to recommend it to classmates if it may become completely unusable at some point.
I suppose what I'm asking is if there is any easier or cleaner way--possibly as part of an existing package--to implement these visualizations in R, or if my only option is using the above (very old) code to do it. I'm aware of solutions in other programming languages (Python) as well as other pieces of software (Orange, VisuLab), but since the class is primarily based around using R, I'd like to be able to present in it if I can.

Sounds like we need to teach you to search. The google pathway is always available but for R functionality it sometimes is not sufficiently specific if the topic name is commonly used for other concepts. I often pair the search term with 'rproject'
https://www.google.com/search?q=radviz&ie=utf-8&oe=utf-8#q=radviz+rproject
Brings up:
http://www.cs.uml.edu/~phoffman/Radviz/readme.txt # R interface to C-implementation
... as well as many others but it would take some effort to sort through to find R-specific implementation.
I have many successes using the findFn-function in the sos package:
install.packages("sos")
library(sos)
Originally I thought this was just the ordinary radar chart but seems it might be something different.
> findFn("Radial Coordinate Visualization")
found 12 matches; retrieving 1 page
Downloaded 4 links in 3 packages.
The search on Radviz only brings up a single item, radviz2d, whose help page links to a surveyplot function in the same package 'dprep'. The term 'radial' alone brought up a large number, possibly unmanageable:
> findFn("radial plots")
found 456 matches; retrieving 20 pages, 400 matches.
2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
Those terms deliver a somewhat more manageable number. Radar plots or spider plots are generally being used for discrete variables, but radial coordinate visualization appears to be a method of projecting multivariate associations on a two-dimensional domain. The 'circular' package also deals with display and statistics on continuous variables.
From the CRAN Archive I downloaded and unpackaed version 2.1 of radviz: dprep_2.1.tar.gz:
source('~/Downloads/dprep/R/radviz2d.R', chdir = TRUE)
mmnorm <-
function (data,minval=0,maxval=1)
{
d=dim(data)
c=class(data)
cnames=colnames(data)
classes=data[,d[2]]
data=data[,-d[2]]
minvect=apply(data,2,min)
maxvect=apply(data,2,max)
rangevect=maxvect-minvect
zdata=scale(data,center=minvect,scale=rangevect)
newminvect=rep(minval,d[2]-1)
newmaxvect=rep(maxval,d[2]-1)
newrangevect=newmaxvect-newminvect
zdata2=scale(zdata,center=FALSE,scale=(1/newrangevect))
zdata3=zdata2+newminvect
zdata3=cbind(zdata3,classes)
if (c=="data.frame") zdata3=as.data.frame(zdata3)
colnames(zdata3)=cnames
return(zdata3)
}
load("/Users/davidwinsemius/Downloads/dprep/data/my.iris.rda")
radviz2d(my.iris,"Iris")
The package also has several other functions including survey plot that are available in R, so they do not need to be compile. There is a compiled function in the package which I have not investigated.

I have released a new version of dprep. Edgar Acuna

Related

Analysing vocal similarity of little owls using warbleR in R

I am struggling a bit with an analysis I need to do. I have collected data consisting of little owl calls that were recorded along transects. I want to analyse these recordings for similarity, in order to see which recorded calls are from the same owls and which are from different owls. In that way I can make an estimate of the size of the population at my study area.
I have done a bit of research and it seems that the package warbleR seems to be suitable for this. However, I am far from an R expert and am struggling a bit with how to go about this. Do any of you have experience with these types of analyses and maybe have example scripts? It seems to me that I could use the function cross_correlation and maybe make a pca, however in the warbleR vignette I looked at they only do this for different types of calls and not for the same type call from different individuals, so I am not sure if it would work.
to be able to run analyses with warbleR you need to input the data using the "selection_table" format. Take a look at the example data "lbh_selec_table" to get a sense of the format:
library(warbleR)
data(lbh_selec_table)
head(lbh_selec_table)
The whole point of these objects is to tell R the time location in your sound files (in seconds) of the signals you want to analyze. Take a look at this link for more details on this object structure and how to import it into R.

R programming spectrum analysis

hello am new to R programming in r studio . I will be analyzing the spectral data of raman spectrum in future.
which package will be useful to for the spectral data analysis.I would like to learn that package. I have attached the image how I want to analyze. Please give me suggestions, how to plot the graph as shown in the fig in r studio
thanks in advance
There is a free package called hyperSpec that was specifically designed to handle spectral data together with associated extra data (e.g. experimental parameters etc.). The package also provides interface for common operations, like baseline correction, selection of spectral ranges, normalization, PCA, etc. Moreover, there is a host of plotting functions.
You can install it from CRAN with install.packages("hyperSpec"), however, as of today the CRAN version is outdated. I would recommend you to fetch the recent build from gitHub and install it via Rstudio (look for packages->install->from package archive file).
hyperSpec comes with an extensive documentation and example datasets. To browse through tutorials, run
browseVignettes("hyperSpec")
Plotting is as easy as
plot(chondro) # left plot
qplotspc(chondro) + ggtitle("Example dataset") # right plot
To import your own data, look for functions inside of hyperSpec, whose name starts with read. Just start typing hyperSpec::read and a pop-up will appear. A lot of device-specific data formats are supported. See vignette("fileio") for details.

Text Categorization using R

I am relatively new at using R. I have a dataset of around 5000 datapoints.
My goal is to predict a category using the comments entered.
I have a training dataset of 4500 records and a testing data set of 500 records.
I am looking for 2-3 packages which might help me in doing this.I have to evaluate these packages and prepare a report on that. Can anyone suggest me some good packages which might easier to use and also more efficient.
Again, I have 2 columns
1st one is comments and based on this I have to predict the category.
Right now I have defined around 10 independent categories.
Most of the comments have specific keywords which I have defined as categories
One such example
Comment 1
The website is pretty good --->> category would be WebsiteContent
comment 2 might be like
Excellent article ,very detailed--->> same category as above(WebsiteContent)
But the keywords such as article, website are very limited and can be linked to the category
all of comments are different but the underlying keywords are mostly the same
Thanks,
Ankan
Although all you need is a very long and well written set of if-else statements, try using a Decision tree from the package from the rpart and prp package. I'm saying this only cause you're trying to learn and I'm guessing this is for some assignment which you're supposed to be doing on your own.
tree<-rpart(train$decision~train$comment, method"class")
prp(tree)
The first line builds the model and the second one plots it. This might be a bit overboard actually but since you're learning R this is a fun thing to work with and can be used for a wide variety of things. Although, Decision trees work better with more predictor variables.
Use predict(test,tree) to test out the model on your test dataset.

Comparing R to Matlab for Data Mining

Instead of starting to code in Matlab, I recently started learning R, mainly because it is open-source. I am currently working in data mining and machine learning field. I found many machine learning algorithms implemented in R, and I am still exploring different packages implemented in R.
I have quick question: how do you compare R to Matlab for data mining application, its popularity, pros and cons, industry and academic acceptance etc.? Which one would you choose and why?
I went through various comparisons for Matlab vs R against various metrics but I am specifically interested to get answer for its applicability in Data Mining and ML.
Since both language are pretty new for me I was just wondering if R would be a good choice or not.
I appreciate any kind of suggestions.
For the past three years or so, i have used R daily, and the largest portion of that daily use is spent on Machine Learning/Data Mining problems.
I was an exclusive Matlab user while in University; at the time i thought it was
an excellent set of tools/platform. I am sure it is today as well.
The Neural Network Toolbox, the Optimization Toolbox, Statistics Toolbox,
and Curve Fitting Toolbox are each highly desirable (if not essential)
for someone using MATLAB for ML/Data Mining work, yet they are all separate from
the base MATLAB environment--in other words, they have to be purchased separately.
My Top 5 list for Learning ML/Data Mining in R:
Mining Association Rules in R
This refers to a couple things: First, a group of R Package that all begin arules (available from CRAN); you can find the complete list (arules, aruluesViz, etc.) on the Project Homepage. Second, all of these packages are based on a data-mining technique known as Market-Basked Analysis and alternatively as Association Rules. In many respects, this family of algorithms is the essence of data-mining--exhaustively traverse large transaction databases and find above-average associations or correlations among the fields (variables or features) in those databases. In practice, you connect them to a data source and let them run overnight. The central R Package in the set mentioned above is called arules; On the CRAN Package page for arules, you will find links to a couple of excellent secondary sources (vignettes in R's lexicon) on the arules package and on Association Rules technique in general.
The standard reference, The Elements of Statistical
Learning by Hastie et al.
The most current edition of this book is available in digital form for free. Likewise, at the book's website (linked to just above) are all data sets used in ESL, available for free download. (As an aside, i have the free digital version; i also purchased the hardback version from BN.com; all of the color plots in the digital version are reproduced in the hardbound version.) ESL contains thorough introductions to at least one exemplar from most of the major
ML rubrics--e.g., neural metworks, SVM, KNN; unsupervised
techniques (LDA, PCA, MDS, SOM, clustering), numerous flavors of regression, CART,
Bayesian techniques, as well as model aggregation techniques (Boosting, Bagging)
and model tuning (regularization). Finally, get the R Package that accompanies the book from CRAN (which will save the trouble of having to download the enter the datasets).
CRAN Task View: Machine Learning
The +3,500 Packages available
for R are divided up by domain into about 30 package families or 'Task Views'. Machine Learning
is one of these families. The Machine Learning Task View contains about 50 or so
Packages. Some of these Packages are part of the core distribution, including e1071
(a sprawling ML package that includes working code for quite a few of
the usual ML categories.)
Revolution Analytics Blog
With particular focus on the posts tagged with Predictive Analytics
ML in R tutorial comprised of slide deck and R code by Josh Reich
A thorough study of the code would, by itself, be an excellent introduction to ML in R.
And one final resource that i think is excellent, but didn't make in the top 5:
A Guide to Getting Stared in Machine Learning [in R]
posted at the blog A Beautiful WWW
Please look at the CRAN Task Views and in particular at the CRAN Task View on Machine Learning and Statistical Learning which summarises this nicely.
Both Matlab and R are good if you are doing matrix-heavy operations. Because they can use highly optimized low-level code (BLAS libraries and such) for this.
However, there is more to data-mining than just crunching matrixes. A lot of people totally neglect the whole data organization aspect of data mining (as opposed to say, plain machine learning).
And once you get to data organization, R and Matlab are a pain. Try implementing an R*-tree in R or matlab to take an O(n^2) algorithm down to O(n log n) runtime. First of all, it totally goes against the way R and Matlab are designed (use bulk math operations wherever possible), secondly it will kill your performance. Interpreted R code for example seems to run at around 50% of the speed of the C code (try R built-in k-means vs. flexclus k-means); and the BLAS libraries are optimized to an insane level, exploiting cache sizes, data alignment, advanced CPU features. If you are adventurous, try implementing a manual matrix multiplication in R or Matlab, and benchmark it against the native one.
Don't get me wrong. There is a lot of stuff where R and matlab are just elegant and excellent for prototyping. You can solve a lot of things in just 10 lines of code, and get a decent performance out of it. Writing the same thing by hand would be hundreds of lines, and probably 10x slower. But sometimes you can optimize by a level of complexity, which for large data sets does beat the optimized matrix operations of R and matlab.
If you want to scale up to "Hadoop size" on the long run, you will have to think about data layout and organization, too, unless all you need is a linear scan over the data. But then, you could just be sampling, too!
Yesterday I found two new books about Data mining. These series of books entitled by ‘Data Mining’ address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters.The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. Books are: “New Fundamental Technologies in Data Mining” here http://www.intechopen.com/books/show/title/new-fundamental-technologies-in-data-mining & “Knowledge-Oriented Applications in Data Mining” here http://www.intechopen.com/books/show/title/knowledge-oriented-applications-in-data-mining These are open access books so you can download it for free or just read on online reading platform like I do. Cheers!
We should not forget the origin sources for these two software: scientific computation and also signal processing leads to Matlab but statistics leads to R.
I used matlab a lot in University since we have one installed on Unix and open to all students. However, the price for Matlab is too high especially compared to free R. If your major focus is not on matrix computation and signal processing, R should work well for your needs.
I think it also depends in which field of study you are. I know of people in coastal research that use a lot of Matlab. Using R in this group would make your life more difficult. If a colleague has solved a problem, you can't use it because he fixed it using Matlab.
I would also look at the capabilities of each when you are dealing with large amounts of data. I know that R can have problems with this, and might be restrictive if you are used to an iterative data mining process. For example looking at multiple models concurrently. I don't know if MATLAB has a data limitation.
I admit to favoring MATLAB for data mining problems, and I give some of my reasoning here:
Why MATLAB for Data Mining?
I will admit to only a passing familiarity with R/S-Plus, but I'll make the following observations:
R definitely has more of a statistical focus than MATLAB. I prefer building my own tools in MATLAB, so that I know exactly what they're doing, and I can customize them, but this is more of a necessity in MATLAB than it would be in R.
Code for new statistical techniques (spatial statistics, robust statistics, etc.) often appears early in S-Plus (I assume that this carries over to R, at least some).
Some years ago, I found the commercial version of R, S-Plus to have an extremely limited capacity for data. I cannot say what the state of R/S-Plus is today, but you may want to check if your data will fit into such tools comfortably.

Are there any R Packages for Graphs (shortest path, etc.)?

I know that R is statistical pkg, but probably there is library to work with graphs and find shortest path btw 2 nodes.
PS actually, I've found igraph and e1071, which one is better?
Thank you
Sure, there's a Task View that gathers a fair number of the graph-related Packages. (The page linked to is a CRAN portal, which uses iframes, so i can't directly link to the Graph Task View. So from the page linked to here, click on Task Views near the top of the LHS column, then click on the Task View gR, near the bottom of the list.
Among the Packages there, igraph, for instance, has graph-theoretic functions such as you have mentioned in your Q.
igraph versus e1071--well, igraph is coded in C; it's very fast. I have not compared it with e1071 though.
What i do know is that these two packages differ a great deal in scope: e1071 is a collection of functions (at least originally) for a University course (i believe the unusual name 'e1071' refers to the course identifier), while. e1071 indeed contains a graph theoretic functions, but the majority of the Package's functions are directed to machine learning.
iGraph on the other hand is a dedicated graph theoretic Package. iGraph has many more dedicated functions, as well as constructors for a number of common graph types.

Resources