Converting Data into two separate groups - r

I have simulated a Data for two groups coming from a multivariate normal distribution in R as per below:
#Package to generate a multivariate normal distribution
library(mvtnorm)
#The number of simulated variables that can be changed
p=5
set.seed(30)
#Generating the eigenvalues from a uniform distribution.
m=p
eigval <- runif(m,0.25,1)
#Generating a positive symmetric matrix (this will be used as the covariance matrix for generation of the data.
#Ravi Varadhan(2008)
shat <- matrix(ncol=m, rnorm(m^2))
decomp <- qr(shat)
Q <- qr.Q(decomp)
R <- qr.R(decomp)
d <- diag(R)
ph <- d/abs(d)
O <- Q%*%diag(ph)
shat <- t(O)%*%diag(eigval)%*%(O)
#Variance-covariance matrix for the data generation.
sig <- shat
#Mean vectors for two groups where the parameters may be changed accordingly.
m1 <- runif(p,0.1,0.2)
m2 <- runif(p,0.4,0.9)
#Euclidean distance between two groups
dist(rbind(m1,m2), method = "euclidean")
#The number of observations from group1
n1 <- 30
#The number of observation from group2
n2 <- 70
#The total number of observations
n <- n1+n2
#Group Identifier where '1' represent group 1 and '2' represent group 2
G1 <- rep(1,n1)
G2 <- rep(2,n2)
G <- c(G1,G2)
#Generate Data from group
library(mvtnorm)
g1 <- rmvnorm(n=n1, mean=m1, sigma=sig)
g2 <- rmvnorm(n=n2, mean=m2, sigma=sig)
g <-rbind(g1,g2)
Data <- data.frame(G, DV1=g[ , 1], DV2=g[ , 2], DV3=g[ ,3], DV4=g[,4], DV5=g[ ,5])
Now I want to apply the QDA function on this simulated data by using
the below coding which was found online:
https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/qda.html
However in this example it is said that the in-built IRIS data has been split into a data arranged as a 3-dimensional array of size 50 by 4 by 3, as represented by S-PLUS. (see - https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html)
Can someone tell me how any data can be split into n x m x p?

Not certain if you want an answer to your code or to the question about iris3. I'll talk about the latter for a moment.
The fact that it is a tidy array with 3 dimensions is convenience and for demonstration. It works because Edgar Anderson harvested exactly 50 samples of each species. There is nothing in the immediate documentation that suggests there is a relevant pairing between the first setosa and the first virginica, so the data is not paired. Unfortunately, by arranging species as planes in the cube, it suggests this paired relationship.
Consider this: had Edgar instead sampled 51 setosa instead of 50, but kept the other two species at 50, how would the array look? One of the planes would be a little taller than the other two, not a matrix. What if he sampled the 50 setosa in a different order (because it is not stated that order matters). The array would be different, and analysis that looks at the 3rd margin (iris3[1,1,]) would return different results, but the actual data really didn't change.
So, I believe the fact that it is in a perfectly-arrange 3-D matrix is for the purposes of dealing with multi-dimensional data, not because the data actually belong in that orientation.
EDIT
Given that you want to know how to transform (any) data from 2D to a 3D array, here's an example using iris. This makes a couple of assumptions:
All of the data is of the same class. For example, in the example below I remove the $Species column; because an array requires everything internally to be the same class, if I did not remove it then all of the numbers would be converted to character, probably not what you want.
The pairing within the added dimension is actually relevant, as I discussed above. This process works just fine if the data is not paired, then it is perfectly logical to think that with other data, there may be different counts for data in the different categories.
Similar (and tied) to #2, all categories should have the same amount of data. This can be waved-away if you are willing to accept rows of NA to extend in the shorter categories, but that seems a bit sloppy to me.
Base R
First, we split the current 2D data into groups, conveniently (but necessarily) resulting in elements with the same dimensions (50 x 4). The -5 removes the fifth column, $Species, so that our next step using as.matrix will not convert numeric to character.
irislist <- by(iris, iris$Species, `[`, -5)
Pre-populate a 3D array per the dimensions of the source data.
mtx <- array(NA, dim = c(dim(irislist[[1]]), length(irislist)))
This might be done with one of the *apply functions, but I couldn't get it to work generically. Perhaps somebody can comment with a suggestion.
for (i in seq_along(irislist)) mtx[,,i] <- as.matrix(irislist[[i]])
The 3D matrix is made! It might be nice to add dimension names to it, though not strictly required:
dimnames(mtx) <- list(NULL, colnames(irislist[[1]]), names(irislist))
mtx
# , , setosa
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 5.1 3.5 1.4 0.2
# [2,] 4.9 3.0 1.4 0.2
# [3,] 4.7 3.2 1.3 0.2
# [4,] 4.6 3.1 1.5 0.2
# [5,] 5.0 3.6 1.4 0.2
# ...snip...
abind
This can also be done using the abind, without the need to pre-allocation mtx, go through a for loop, or do any dimension naming:
library(abind)
mtx2 <- do.call("abind", c(irislist, list(along = 3)))
str(mtx)
# num [1:50, 1:4, 1:3] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# - attr(*, "dimnames")=List of 3
# ..$ : NULL
# ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
# ..$ : chr [1:3] "setosa" "versicolor" "virginica"
Wrap-Up
It isn't obvious how this would work with your data. When I ran your code, I ended up with six columns, only one of which (Data$G) would seem to be something you could split into another dimension (i.e., it looks like it could be categorical). Unfortunately:
table(Data$G)
# 1 2
# 30 70
and per my third bullet, this doesn't work.

Related

Best way to subset RNAseq dataset for comparison in R

I have a single-cell RNAseq dataset that I have been using R to analyze. So I have a data frame with 205 columns and 15000 rows. Each column is a cell and each row is a gene.
I have an annotation matrix that has the identity of each cell. For example, patient ID, disease status, etc...
I want to do different comparisons based on the grouping info provided by the annotation matrix.
I know that in python, you can create a dictionary that is attached to the cell IDs.
What is an efficient way in R to perform subsetting of the same dataset in different ways?
So far what I have been doing is:
EC_index <-subset(annotation_index_LN, conditions == "EC_LN")
CP_index <-subset(annotation_index_LN, conditions =="CP_LN")
CD69pos <-subset(annotation_index_LN, CD69 == 100)
EC_CD69pos <- subset(EC_index, CD69 == 100)
EC_CD69pos <- subset(EC_CD69pos, id %in% colnames(manual_normalized))
CP_CD69pos <- subset(CP_index, CD69 == 100)
CP_CD69pos <- subset(CP_CD69pos, id %in% colnames(manual_normalized))
This probably won't entirely answer your question, but I think that even before you begin trying to subset your data etc. you might want to think about converting this into a SummarizedExperiment. This is a type of object that can hold annotation data for features and samples and will keep everything properly referenced if you decide to subset samples, remove rows, etc. This type of object is commonly implemented by packages hosted on Bioconductor. They have loads of tutorials on various genomics pipelines, and I'm sure you can find more detailed information there.
http://bioconductor.org/help/course-materials/
Following is from the iris data in R since you haven't given a minimal example of your data.
For that you need a R package that gives access to %>%: the magrittr R package, but also available in dplyr.
If you have to a lot of subsetting, the have the following in a function where you pass the arguments to subset.
iris %>%
subset(Species == "setosa" & Petal.Width == 0.2 & Petal.Length == 1.4) %>%
subset(select = !is.na(str_match(colnames(iris), "Len")))
# Sepal.Length Petal.Length
# 1 5.1 1.4
# 2 4.9 1.4
# 5 5.0 1.4
# 9 4.4 1.4
# 29 5.2 1.4
# 34 5.5 1.4
# 48 4.6 1.4
# 50 5.0 1.4

Split data in R and perform operation

I have a very large file that simply contains wave heights for different tidal scenarios at different locations. My file is organized into 13 wave heights x 9941 events, for 5153 locations.
What I want to do is read in this very long data file, which looks like this:
0.0
0.1
0.2
0.4
1.2
1.5
2.1
.....
Then split it into segments of length 129,233 (corresponds to 13 tidal scenarios for 9941 events at a specific location). On this subset of the data I'd like to perform some statistical functions to calculate exceedance probability, among other things. I will then join it to the file containing location information, and print some output files.
My code so far is not working, although I've tried many things. It seems to read the data just fine, however it is having trouble with the split. I suspect it may have something to do with the format of the input data from the file.
# read files with return period wave heights at defense points
#Read wave heights for 13 tides per 9941 events, for 5143 points
WaveRP.file <- paste('waveheight_test.out')
WaveRPtable <- read.csv(WaveRP.file, head=FALSE)
WaveRP <- c(WaveRPtable)
#colnames(WaveRP) <- c("WaveHeight")
print(paste(WaveRP))
#Read X,Y information for defense points
DefPT.file <- paste('DefXYevery10thpt.out')
DefPT <- read.table(DefPT.file, head=FALSE)
colnames(DefPT) <- c("X_UTM", "Y_UTM")
#Split wave height data frame by defense point
WaveByDefPt <- split(WaveRP, 129233)
print(paste(length(WaveByDefPt[[1]])))
for (i in 1:length(WaveByDefPt)/129233){
print(paste("i",i))
}
I have also tried
#Split wave height data frame by defense point
WaveByDefPt <- split(WaveRP, ceiling(seq_along(WaveRP)/129233))
No matter how I seem to perform the split, I am simply getting the original data as one long subset. Any help would be appreciated!
Thanks :)
Kimberly
Try cut to build groups:
v <- as.numeric(readLines(n = 7))
0.0
0.1
0.2
0.4
1.2
1.5
2.1
groups <- cut(v, breaks = 3) # you want breaks = 129233
aggregate(x = v, by = list(groups), FUN = mean) # e.g. means per group
# Group.1 x
# 1 (-0.0021,0.699] 0.175
# 2 (0.699,1.4] 1.200
# 3 (1.4,2.1] 1.800
You are kind of shuffling the data into various data types here.
When the file is originally read, it is a dataframe with 1 column (V1). Then you pass it to c(), which results in a list with a single vector in it. This means if you try and do anything to WaveRP you will probably fail because that's the name of the list. The numeric vector is WaveRP[[1]].
Instead, just extract the numeric vector using the $ operator and then you can work with it. Or just work with it inside the data frame. The fun part will be thinking of a way to create the grouping vector. I'll give an example.
Something like this:
WaveRP.file <- paste('waveheight_test.out')
WaveRPtable <- read.csv(WaveRP.file, head=FALSE)
WaveRPtable$group <- ceiling(seq_along(WaveRPtable$V1)/129233)
SplitWave <- split(WveRPtable,WaveRPtable$group)
Now you will have a list containing 13 dataframes. Look at each one using double bracket indexing. SplitWave[[2]], for example, to look at the second group. You can merge the location information file with these dataframes individually.

Create a new (identical) data frame by sampling an existing data frame column-wise

I am trying to create a new data frame which is identical in the number of columns (but not rows) of an existing data frame. All columns are of identical type, numeric. I need to sample each column of the original data frame (n=241 samples, replace=T) and add those samples to the new data frame at the same column number as the original data frame.
My code so far:
#create the new data frame
tree.df <- data.frame(matrix(nrow=0, ncol=72))
#give same column names as original data frame (data3)
colnames(tree.df)<-colnames(data3)
#populate with NA values
tree.df[1:241,]=NA
#sample original data frame column wise and add to new data frame
for (i in colnames(data3)){
rbind(sample(data3[i], 241, replace = T),tree.df)}
The code isn't working out. Any ideas on how to get this to work?
Use the fact that a data frame is a list, and pass to lapply to perform a column-by-column operation.
Here's an example, taking 5 elements from each column in iris:
as.data.frame(lapply(iris, sample, size=5, replace=TRUE))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.7 3.2 1.7 0.2 versicolor
## 2 5.8 3.1 1.5 1.2 setosa
## 3 6.0 3.8 4.9 1.9 virginica
## 4 4.4 2.5 5.3 0.2 versicolor
## 5 5.1 3.1 3.3 0.3 setosa
There are several issues here. Probably the one that is causing things not to work is that you are trying to access a column of the data frame data3. To do that, you use the following data3[, i]. Note the comma. That separates the row index from the column index.
Additionally, since you already know how big your data frame will be, allocate the space from the beginning:
tree.df <- data.frame(matrix(nrow = 241, ncol = 72))
tree.df is already prepopulated with missing (NA) values so you don't need to do it again. You can now rewrite your for loop as
for (i in colnames(data3)){
tree.df[, i] <- sample(data3[, i], 241, replace = TRUE)
}
Notice I spelled out TRUE. This is better practice than using T because T can be reassigned. Compare:
T
T <- FALSE
T
TRUE <- FALSE

Split Data into groups of equal means

I'm looking for a way to split a data frame into groups of equal size (essentially same number of rows in each group), whose groups have a nearly equal mean.
User Data
1 5.0
2 4.5
3 3.5
4 6.0
5 7.0
6 6.5
7 5.5
8 6.2
9 5.7
10 5.9
This is very similar to this request However this only splits the data into 2 groups.
My actual dataset contains anywhere from 75-150 rows, and I need to split it into anywhere from 5-10 groups of equal mean and fairly equal size.
I've researched on Google & Stack Exchange for the last few days, and I'm just not having much luck. Any guidance would be great.
Thanks in advance!
More details:
Maybe I need to provide some more details, below I've included a real dataset. We are a transportation company, this data set has Driver ID, Miles, Gallons provided. What I have been doing is reading the data into R, and adding and MPG column like so:
data <- read.csv('filename')
data$MPG <- data$Miles / data$Gallons
Then I tried the two provided answers below. Arun's idea gives me almost equal group sizes (9 members per group, 10 groups), however the variation of the means is large, from 6.615 - 7.093 which is too large of a variation for me to start off with. Thomas' idea gets a little bit tighter variation, but the group sizes are all different from 6 - 13 members.
What we are looking to do is improve fleet MPG, and we're going to accomplish this with a team based competition, so I need to randomly put the teams together with them all starting from relatively the same group MPG.
Maybe that helps and can lead us in the correct direction? I tried doing this just in my programming language, but it locks the computer up every time, so I figured that R would probably be able to process the data better.
Thanks again!
If similar means is really all that matters, I've put together a simulation below that basically looks at a bunch of different combinations of the data (n) for a particular group size (k) and then minimizes the variance of the group means. With that minimization you can then extract that grouping from the simulation results.
df <- data.frame(User=1:1000,Data=rnorm(1000,0,1)) # example data
myfun = function(){
k <- 5 # number of groups
tmp <- seq(length(mpg))%%ngroups # really efficient code from #qwwqwwq's answer
thisgroup <- sample(tmp, dim(df)[1], FALSE) # pull a sample
# thisgroup <- sample(1:k,dim(df)[1],TRUE) # original version
thisavg <- as.vector(by(df$Data, thisgroup, mean)) # group means
thisvar <- var(thisavg) # variance of means
return(list(group=thisgroup, avgs=thisavg, var=thisvar))
}
n <- 1000 # number of simulations
sorts <- replicate(n, myfun(), simplify=FALSE)
wh <- which.min(sapply(sorts, function(x) x$var)) # minimization
# sorts[[wh]] # this is the sample you want
split(df, sorts[[wh]]$group) # list of separate dataframes for each group
You could also have k of different sizes, if you don't care about how many cases are in each group by just moving the k <- 5 line into the function and having it be a random draw from the range of number of groups you're willing to have.
There are probably other ways to do this, though.
Going by Thomas' idea, here's a brute-force/greedy approach, which'll give more or less the same values (you can opt for more repetitions until you agree with the closeness of the solution).
# Assuming the data you provided is in `df`
grp <- 5
myfun <- function() {
samp <- sample(nrow(df))
s.mean <- tapply(df$Data, samp %% grp, mean)
s.var <- var(s.mean)
list(samp, s.mean, s.var)
}
out <- replicate(1000, myfun(), simplify=FALSE)
min.pos <- which.min(sapply(out, `[[`, 3))
min.idx <- out[[min.pos]][[1]]
split(df$Data[min.idx], min.idx %% grp)
$`0`
[1] 7.0 5.9
$`1`
[1] 5.0 6.5
$`2`
[1] 5.5 4.5
$`3`
[1] 6.2 3.5
$`4`
[1] 5.7 6.0
This is how out[min.pos] looks like:
out[min.pos]
[[1]]
[[1]][[1]]
[1] 7 9 8 5 3 4 1 2 10 6
[[1]][[2]]
0 1 2 3 4
5.85 5.70 5.60 5.25 5.50
[[1]][[3]]
[1] 0.05075
Simplest way I can think of: Sort the data, modulo all the indicies by the number of groups, and you're done. Should work well if the data are normally distributed I think. Has the advantage of the groups being as equally sized as possible.
mpg <- rnorm(150)
mpg <- sort(mpg)
ngroups = 13
df = data.frame( mpg=mpg, group=seq(length(mpg))%%ngroups)
tapply(df$mpg, df$group, mean)
0 1 2 3 4 5 6 7 8
0.080400272 -0.110797283 -0.046698548 -0.014177675 0.024410834 0.048370962 0.066265303 0.087119914 -0.062259638
9 10 11 12
-0.042172496 -0.003451581 0.033853024 0.056947458

using ffdfdply to split data and get characteristics of each id in the split

Within R I'm using ffdf to work with a large dataset. I want to use ffdfdply from the ffbase package to split the data according to a certain variable (var) and then compute some characteristics for all the observations with a unique value for var (for example: the number of observations for each unique value of var). To see if this is possible using ffdfdply I executed the example described below.
I expected that it would split on each Species and then calculate the minimum Petal.Width for each Species and then return a two columns each with three entries listing the Species and minimum Petal.Width for that Species. Expected output:
Species min_pw
1 setosa 0.1
2 versicolor 1.0
3 virginica 1.4
However for BATCHBYTES=5000 it will use two splits, one containing two Species and the other containing one Species. This results in the following:
Species min_pw
1 setosa 0.1
2 virginica 1.4
When I change BATCHBYTES to 2000, this will force ffdfdply to use three splits and thus results in the expected output posted above. However I want to have another way of enforcing a split into each unique value of the variable assigned to 'split'. Is there any way to make this happen? Or do you have any other suggestions to get the result I need?
ffiris <- as.ffdf(iris)
result <- ffdfdply(x = ffiris,
split = ffiris$Species,
FUN = function(x) {
min_pw <- min(x$Petal.Width)
data.frame(Species=x$Species, min_pw= min_pw)
},
BATCHBYTES = 5000,
trace=TRUE
)
dim(result)
dim(iris)
result
The function ffdfdply was designed when you have a lot of split elements e.g. when you have 1000000 customers and you want to have data in memory at least split by customer but possibly more customers if your RAM allows such that the internals do not need to do an ffwhich 1000000 times.
That is why the doc of ffdfdply states:
Please make sure your FUN covers the fact that several split elements can be in one chunk of data on which FUN is applied.'
So the solution for your issue is to cover this in FUN namely as follows e.g.
FUN=function(x){
require(doBy)
summaryBy(Petal.Width ~ Species, data=x, keep.names=TRUE, FUN=min)
}

Resources