Peculiarity with Scale and Z-Score - r

I was attempting to scale my data in R after doing some research on the function (which it seems to follow (x - mean) / std.dev. This was just what I was looking for, so I scaled my dataframe in R. I'd also want to make sure my assumptions are correct so that I don't have wrong conclusions.
Assumption
R scales each column independently. Therefore, column 1 will have its own mean and standard deviation. Column 2 will have its own.
Assuming I have a dataset of size 100,000 and I scale 3 columns. If I proceed to remove all columns with a Z-Score over 3 and less than -3, I could have up to (100,000 * .003) = 900 rows removed!
However, when I went to truncate my data, my 100,000 rows were left with 94,798. This means 5,202 rows were removed.
Does this mean my assumption about scale was wrong, and that it doesn't scale by column?
Update
So I ran a test and did Z-Score conversion on my own. Still the same amount of columns removed in the end so I believe scale does work. Now I'm just curious why more than .3% of the data is removed when 3 standard deviations out are removed.

Related

R: imputation of values in a data frame column by distribution of that variable

I have searched stackoverflow and google regarding this but not yet found a fitting answer.
I have a data frame column with ages of individuals.
Out of around 10000 observations, 150 are NAs.
I do not want to impute those with the mean age of the whole column but assign random ages based on the distribution of the ages in my data set i.e. in this column.
How do I do that? I tried fiddling around with the MICE package but didn't make much progress.
Do you have a solution for me?
Thank you,
corkinabottle
You could simply sample 150 values from your observations:
samplevals <- sample(obs, 150)
You could also stratify your observations across quantiles to increase the chances of sampling your tail values by sampling within each quantile range.

R: binning problem in multiple of consistent width

I have been searching for R cutting or binning packages but I could not quite find what I really want.
I have a dataset of 1000 variables and for some columns they might have values ranging from 0.01 to 0.2 but for some other they might have range from 0 to 2000. Some, on the other hand, might contain negative numbers.
I would like to plot the histogram for each of the variables but with a more consistent binning label, i.e. I would like the bin width in multiple of either 1,2.5 or 5 (for decimal numbers maybe of 0.01,0.02 or 0.05) and I am flexible of the bin numbers to vary between 20-40 bins(they can be fixed if it's easier) and not so care about the amount of data in each bin.
The reason for this is because I might get some new data for the same variables and I would like to have consistent binning to their distribution and perhaps model results in the same bin. And there are simply too much variables and I could not do them manually.
Any thoughts on how to write a function for this to return the bins consistent with old and new data before I get the new data.

Sampling according to distribution from a large vetor in R

I have a large vector of 11 billion values. The distribution of the data is not know and therefore I would like to sample 500k data points based on the existing probabilities/distribution. In R there is a limitation of values that can be loaded in a vector - 2^31 -1 which is why I plan to do the sampling manually.
Some information about the data: The data is just integers. And many of them are repeated multiple times.
large.vec <- (1,2,3,4,1,1,8,7,4,1,...,216280)
To create the probabilities of 500k samples across the distribution I will first create the probability sequence.
prob.vec <- seq(0,1,,500000)
Next, convert these probabilities to position in the original sequence.
position.vec <- prob.vec*11034432564
The reason I created the position vector is so that I can pic data point at the specific position after I order the population data.
Now I count the occurrences of each integer value in the population. Create a data frame with the integer values and their counts. I also create the interval for each of these values
integer.values counts lw.interval up.interval
0 300,000,034 0 300,000,034
1 169,345,364 300,000,034 469,345,398
2 450,555,321 469,345,399 919,900,719
...
Now using the position vector, I identify which position value falls in which interval and based on that get the value of that interval.
This way I believe I have a sample of the population. I got a large chunk of the idea from this reference,
Calculate quantiles for large data.
I wanted to know if there is a better approach? Or if this approach could reasonably, albeit crudely give me a good sample of the population?
This process does take a reasonable amount of time, as the position vector as to go through all possible intervals in the data frame. For that I have made it parallel using RHIPE.
I understand that I will be able to do this only because the data can be ordered.
I am not trying to randomly sample here, I am trying to "sample" the data keeping the underlying distribution intact. Mainly reduce 11 billion to 500k.

How to group data to minimize the variance while preserving the order of the data in R

I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?
Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration
I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.

How to calculate Zscore in R

I have log2ratio values of each chromosome position (137221 coordinates) for different samples (15 samples). I want to calculate the Zscore of log2ratio for each chromosome position (row). Also i want to exclude first three columns because it contains ID. There are also some NAs in between the variables..
Thanking you in advance
It's not completely clear what you want. If you want a Z-score for the entire row (i.e., its mean divided by standard error) for all but the first three rows then
f <- function(x) {
mean(x,na.rm=TRUE)/(sd(x,na.rm=TRUE)*sqrt(length(na.omit(x))))
}
apply(as.matrix(df[-(1:3),]),1,f)
will do it. That gives you a vector equal to the number of columns (minus 3).
If you want entire columns of normalized data (Z-scores) then I think
t(scale(t(as.matrix(df[-(1:3),]))))
should work. If neither of those work, you need to post a reproducible example -- or at least tell us precisely what the error messages are.

Resources