Find outliers without loading big data

Find outliers without loading big data - bigdata

I am working with a big gene expression data set R. I am not able to load all the data and do the principal component analysis to find the outliers. Is there anyway to find the outliers (or do pca) without loading the data?

Related

Data structure and package for a radial dendrogram in R

I'd like to create a radial dendrogram in R, but being new to the software, I don't know if I chose the correct data structure and package.
I've created a YAML file that looks as follows:
Data structure
I know the exact hierachy of the languages, but I need R to calculate x and y values. I'd use hclust for that, I think?
I found this instruction here for example: https://stats.stackexchange.com/questions/4062/how-to-plot-a-fan-polar-dendrogram-in-r, but it uses the mtcars dataset. I'd just like to know whether it makes sense to set up my data as above or whether I should use a different structure. When I try to import the datasets I get an error message saying I've got more columns than column headers so I must be doing something wrong.

Stacked or Grouped Barcharts from Weighted Survey Data (Class="survey.design2" "survey.design") in R

I am working with weighted survey data of the class survey.design2 and survey.design. With the package survey, and the function call svytable, I can create contingency tables for survey data. With these contingency tables, I can then create normal bar-charts using lattice. The standard way for doing this (e.g. barchart(cars ~ mpg | factor(cyl), data=mtcars,...)) doesn't work for this data type.
I am used to working with ggplot2, and would like to create either stacked or grouped bar-charts, if possible even with facet-wraps. Unfortunately, ggplot2 does not know how to deal with data of the type survey.design2 either. As far as I am concerned, there also does not exist some sort of add-on, which would allow ggplot2 to deal with this kind of data.
So far I have:
sub-set my data set
converted it into class survey.design2 with the function call svydesign(),
plotted multiple bar-charts in one window using grid.arrange(). This sort of provides for a work around for facetting, but still doesn't allow me to create stacked or grouped bar-charts.
I'd be grateful for any suggestions.
Thank you

Good morning MatthewR
I have a data set with 62732 observations and 691 variables.
Original Data Set
So any example based on a random number generator should work as well, I guess. I am really just interested in a work around to this issue, not necessarily the final code.
I then convert the data frame into survey.design format using:
df_Survey <- svydesign(id=~1, weights=~IXPXHJ, data=df). IXPXHJ is the variable by which the original sample data set will be weighted so as to get the entire population. head(df$IXPXHJ) looks something like this:
87.70876
78.51809
91.95209
94.38899
105.32005
56.30210
str(df_Survey) looks something like this.
Survey Data Structure

Correlation calculation in Tableau and R

I have huge a dataset and have to calculate the correlation matrix for different indices based on the user's selection (Filter). I applied formula in both Tableau calculation field and R.
Both are successful if the indices have the required data and the output results are also the same. In case if any of the indices that I choose have less data than I required (i.e., Only 2 years data available, whereas i want to see for 3 years correlation).
"R" automatically ignores those indices which doesn't have the value for the time frame. Whereas in Tableau, it still calculates for the indices & shows result in output, even if it has one data point. Actually those indices shouldn't show any result like "R".
How to remove those Indices from Tableau calculation, any help on this regard would certainly appreciable.
Note: It is very easy for me to use "R" instead of Tableau, however our Tableau server doesn't connect with "R Serve" due to technology restriction. Hence I recommended to use Tableau Calculations only
Tableau Calculation Code:
(WINDOW_SUM(SIZE()*[ValueAcross]*[ValueDown])-WINDOW_SUM([ValueDown])*WINDOW_SUM‌([ValueAcross])) / (SQRT(((WINDOW_SUM(SIZE()*[ValueAcross]^2)-WINDOW_SUM([ValueAcross])^2))*(WINDOW‌_SUM(SIZE()*[ValueDown]^2)-WINDOW_SUM([ValueDown])^2)))
R Calculation Code:
Script_Real("cor(.arg1,.arg2, method='pearson')",([ValueAcross]),([ValueDown]))

Applying survey weights to data before compiling contingency tables in R

The sample for a survey I am analysing was not selected randomly and so I need to apply a vector of weights to make the findings representative of the population. I have used wtd.table() (from gmodels) successfully to create frequency tables but now want to create a contingency table to compare two categorical variables and conduct a Chi-sqrd test. I'm struggling to find the right function. The svytable() function in the survey package sounds promising but I don't see where I should input the weight vector. I'm new to R. Could anyone explain how to use svytable() or suggest an alternative?

R - 'princomp' can only be used with more units than variables

I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph.
"'princomp' can only be used with more units than variables"
I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again.
Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.
IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample???
Please help, been wrecking my head for a week now.
Thanks guys.

The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.
In the help file for princomp it explains (read ?princomp):
‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly
via a formula) it is required that there are at least as many
units as variables. For Q-mode PCA use ‘prcomp’.

Principal component analysis is underspecified if you have fewer samples than data point.
Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.
Simply speaking you can look at the problems like this:
If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.

you can use prcomp instead of princomp

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Find outliers without loading big data - bigdata

I am working with a big gene expression data set R. I am not able to load all the data and do the principal component analysis to find the outliers. Is there anyway to find the outliers (or do pca) without loading the data?

Related

Data structure and package for a radial dendrogram in R

Stacked or Grouped Barcharts from Weighted Survey Data (Class="survey.design2" "survey.design") in R

Correlation calculation in Tableau and R

Applying survey weights to data before compiling contingency tables in R

R - 'princomp' can only be used with more units than variables

Categories

Resources