dealing with data table with redundant rows - r

The title is not precisely stated but I could not come up with other words which summarizes what I exactly going to ask.
I have a table of the following form:
value (0<v<1) # of events
0.5677 100000
0.5688 5000
0.1111 6000
... ...
0.5688 200000
0.1111 35000
Here are some of the things I like to do with this table: drawing the histogram, computing mean value, fitting the distribution, etc. So far, I could only figure out how to do this with vectors like
v=(0.5677,...,0.5688,...,0.1111,...)
but not with tables.
Since the number of possible values are huge by being almost continuous, I guess making a new table would not be that effective, so doing this without modifying the original table and making another table would be desirable very much. But if it has to be done so, it's okay. Thanks in advance.
Appendix: What I want to figure out is how to treat this table as a usual data vector:
If I had the following vector representing the exact same data as above:
v= (0.5677, ...,0.5677 , 0.5688, ... 0.5688, 0.1111,....,0.1111,....)
------------------ ------------------ ------------------
(100000 times) (5000+200000 times) (6000+35000) times
then we just need to apply the basic functions like plot, mean, or etc to get what I wanted. I hope this makes my question more clear.

Your data consist of a value and a count for that value so you are looking for functions that will use the count to weight the value. Type ?weighted.mean to get information on a function that will compute the mean for weighted (grouped) data. For density plots, you want to use the weights= argument in the density() function. For the histogram, you just need to use cut() to combine values into a small number of groups and then use aggregate() to sum the counts for all the values in the group. You will find a variety of weighted statistical measures in package Hmisc (wtd.mean, wtd.var, wtd.quantile, etc).

Related

Give different color distribution for different columns in a data.frame

I tried to build a heat-map for the cluster result of my data.frame. My data.frame has 5 columns with corresponding row names. I want to know if I could give the color distribution based on different colors, since the range of my 5 variables are so different, and if I don't scale them the result from "pheatmap" function in R would be a heat-map with only one or two color. And I really don't want to scale the data since I do need the positive or negative sign of my data point to remain what is should be. And here's the head of my data.frame, which I omit the rownames.
r.Square_gamma_logLink cof_glm.gamma_logLink int_glm.gamma_logLink estimated_shape_logLink
0.2524970 0.002357581 8.685446 3.558583
0.5932941 0.002651972 9.486916 8.085618
0.3615135 -0.001646538 10.071672 6.195176
0.4131553 -0.002218262 10.563557 8.671028
0.3529775 -0.002336544 10.984005 4.569396
0.4169932 0.002213259 9.602592 5.216084
estimated_dispersion_logLink
0.2810107
0.1236764
0.1614159
0.1153266
0.2188473
0.1917147
I did try to use the pheatmap, and the heatmap function, which are not quite useful, and the result is looks pretty much like this.

How to create contingency table with multiple criteria subpopulation from weighted data using svyby in the survey package?

I am working with a large federal dataset with thousands of observations and thousands of variables. Replicate weights are provided. I am using the "survey" package in R to apply these weights:
els.weighted=svrepdesign(data=els, repweights = ~els$F3F1PNLWT,
combined.weights = TRUE).
I am interested in some categorical descriptive characteristics of a subset of the population, such as family living arrangements. I want to get these sorted out into a contingency table that shows frequency. I would like to sort people based on four variables (none of which are binary, but all of which are numeric) This is what I would like to get:
.
The blank boxes are where the cross-tabulation/frequency counts would show. (I only put in 3 columns beneath F1COMP for brevity's sake, but it has 9 outcomes – indexed 1-9)
My current code: svyby(~F1FCOMP, ~F1RTRCC +BYS33C +F1A10 +byurban, els.weighted, svytotal)
This code does sort the data, but it sorts every single combination, by default. I want them pared down to represent only specific subpopulations of each variable. I tried:
svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C==1 +F1A10==2 | F1A10==3 +byurban==3, els.weighted, svytotal)
But got stopped:
Error: unexpected '==' in "svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C=="
Additionally, my current version of the code tells me how many cases occur for each combination, This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down.
This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down
.
You can see in that picture that I only get one number for F1FCOMP per row – the number of cases who fit the specified combination – a specific subpopulation. I want to know more about that subpopulation. That is, F1COMP has nine different outcomes (indexed 1-9), and I want to see how many of each subpopulation fits into each of the 9 outcomes of F1COMP.

R: how to divide a vector of values into fixed number of groups, based on smallest distance?

I think I have a rather simple problem but I can't figure out the best approach. I have a vector with 30 different values. Now I need to divide the vector into 10 groups in such a way that the mean within group variance is as small as possible. the size of the groups is not important, it can anything between one and 21.
Example. Let's say I have vector of six values, that I have to split into three groups:
Myvector <- c(0.88,0.79,0.78,0.62,0.60,0.58)
Obviously the solution would be:
Group1 <-c(0.88)
Group2 <-c(0.79,0.78)
Group3 <-c(0.62,0.60,0.58)
Is there a function that gives the same outcome as the example and that I can use for my vector withe 30 values?
Many thanks in advance.
It sounds like you want to do k-means clustering. Something like this would work
kmeans(Myvector,3, algo="Lloyd")
Note that I changed the default algorithm to match your desired output. If you read the ?kmeans help page you will see that there are different algorithms to calculate the different clusters because it's not a trivial computational problem. They might necessarily guarantee optimality.

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

How to extract Mean Square of each group of entry?

Sorry, I am very weak in using R but very interested in it!
Description of my data: I am having raw data collected from a lattice design (4 reps, 44 blocks, 5 plot per block). 220 entries were used, they are classified in three groups with (FS=200 entries; PC=6 entries and TC=14 entries)!
I would like to get the simple mean and the Mean Square of each group (FS, PC and TC) and the Mean square of the error?
Look forward your kind help,
Thx
I think you could go a long way with the aggregate function, like
aggregate(Data$Values, list(Data$Groups), FUN=mean)
for your mean etc.

Resources