Best way to subset RNAseq dataset for comparison in R - r

I have a single-cell RNAseq dataset that I have been using R to analyze. So I have a data frame with 205 columns and 15000 rows. Each column is a cell and each row is a gene.
I have an annotation matrix that has the identity of each cell. For example, patient ID, disease status, etc...
I want to do different comparisons based on the grouping info provided by the annotation matrix.
I know that in python, you can create a dictionary that is attached to the cell IDs.
What is an efficient way in R to perform subsetting of the same dataset in different ways?
So far what I have been doing is:
EC_index <-subset(annotation_index_LN, conditions == "EC_LN")
CP_index <-subset(annotation_index_LN, conditions =="CP_LN")
CD69pos <-subset(annotation_index_LN, CD69 == 100)
EC_CD69pos <- subset(EC_index, CD69 == 100)
EC_CD69pos <- subset(EC_CD69pos, id %in% colnames(manual_normalized))
CP_CD69pos <- subset(CP_index, CD69 == 100)
CP_CD69pos <- subset(CP_CD69pos, id %in% colnames(manual_normalized))

This probably won't entirely answer your question, but I think that even before you begin trying to subset your data etc. you might want to think about converting this into a SummarizedExperiment. This is a type of object that can hold annotation data for features and samples and will keep everything properly referenced if you decide to subset samples, remove rows, etc. This type of object is commonly implemented by packages hosted on Bioconductor. They have loads of tutorials on various genomics pipelines, and I'm sure you can find more detailed information there.
http://bioconductor.org/help/course-materials/

Following is from the iris data in R since you haven't given a minimal example of your data.
For that you need a R package that gives access to %>%: the magrittr R package, but also available in dplyr.
If you have to a lot of subsetting, the have the following in a function where you pass the arguments to subset.
iris %>%
subset(Species == "setosa" & Petal.Width == 0.2 & Petal.Length == 1.4) %>%
subset(select = !is.na(str_match(colnames(iris), "Len")))
# Sepal.Length Petal.Length
# 1 5.1 1.4
# 2 4.9 1.4
# 5 5.0 1.4
# 9 4.4 1.4
# 29 5.2 1.4
# 34 5.5 1.4
# 48 4.6 1.4
# 50 5.0 1.4

Related

R: How to subset with filtering multiple (date-)variables at once

I have a dataset with multiple date-variables and want to create subsets, where I can filter out certain rows by defining the wanted date of the date-variables.
To be more precise: Each row in the dataset represents a patient case in a psychiatry and contains all the applied seclusions. So for each case there is either no seclusion, or they are documented as seclusion_date1, seclusion_date2..., seclusion_enddate1, seclusion_enddate2...(depending on how many seclusions were happening).
My plan is to create a subset with only those cases, where there is either no seclusion documented or the seclusion_date1 (first seclusion) is after 2019-06-30 and all the possible seclusion_enddates (1, 2, 3....) are before 2020-05-01. Cases with seclusions happening before 2019-06-30 and after 2020-05-01 would be excluded.
I'm very new in the R language so my tries are possibly very wrong. I appreciate any help or ideas.
I tried it with the subset function in R.
To filter all possible seclusion_enddates at once, I tried to use starts_with and I tried writing a loop.
all_seclusion_enddates <- function() { c(WMdata, any_of(c("seclusion_enddate")), starts_with("seclusion_enddate")) }
Error: any_of()` must be used within a selecting function.
and then my plan would have been: cohort_2_before <- subset(WMdata, seclusion_date1 >= "2019-07-01" & all_seclusion_enddates <= "2020-04-30")
loop:
for(i in 1:53) { cohort_2_before <- subset(WMdata, seclusion_date1 >= "2019-07-01" & ((paste0("seclusion_enddate", i))) <= "2020-04-30" & restraint_date1 >= "2019-07-01" & ((paste0('seclusion_enddate', i))) <= "2020-04-30") }
Result: A subset with 0 obs. was created.
Since you don't provide a reproducible example, I can't see your specific problem, but I can help with the core issue.
any_of, starts_with and the like are functions used by the tidyverse set of packages to select columns within their functions. They can only be used within tidyverse selector functions to control their behavior, which is why you got that error. They probably are the tools I'd use to solve this problem, though, so here's how you can use them:
Starting with the default dataset iris, we use the filter_at function from dplyr (enter ?filter_at in the R console to read the help). This function filters (selects specific rows) from a data.frame (given to the .tbl argument) based on a criteria (given to .vars_predicate argument) which is applied to specific columns based on selectors given to the .vars argument.
library(dplyr)
iris %>%
filter_at(vars(starts_with('Sepal')), all_vars(.>4))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
2 5.2 4.1 1.5 0.1 setosa
3 5.5 4.2 1.4 0.2 setosa
In this example, we take the dataframe iris, pass it into filter_at with the %>% pipe command, then tell it to look only in columns which start with 'Sepal', then tell it to select rows where all the selected columns match the given condition: value > 4. If we wanted rows where any column matched the condition, we could use any_vars(.>4).
You can add multiple conditions by piping it into other filter functions:
iris %>%
filter_at(vars(starts_with('Sepal')), all_vars(.>4)) %>%
filter(Petal.Width > 0.3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
Here we filter the previous result again to get rows that also have Petal.Width > 0.3
In your case, you'd want to make sure your date values are formatted as date (with as.Date), then filter on seclusion_date1 and vars(starts_with('secusion_enddate'))

Using distinct() with a vector of column names

I have a question using distinct() from dplyr on a tibble/data.frame. From the documentation it is clear that you can use it by naming explicitely the column names. I have a data frame with >100 columns and want to use the funtion just on a subset. My intuition said I put the column names in a vector and use it as an argument for distinct. But distinct uses only the first vector element
Example on iris:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct_(iris, exclude.columns)
This is different from
exclude.columns <- c('Sepal.Width', 'Species')
distinct_(iris, exclude.columns)
I think distinct is not made for this operation. Another option would be to subset the data.frame then use distinct and join again with the excluded columns. But my question is if there is another option using just one function?
As suggested in my comment, you could also try:
data(iris)
library(dplyr)
exclude.columns <- c('Species', 'Sepal.Width')
distinct(iris, !!! syms(exclude.columns))
Output (first 10 rows):
Sepal.Width Species
1 3.5 setosa
2 3.0 setosa
3 3.2 setosa
4 3.1 setosa
5 3.6 setosa
6 3.9 setosa
7 3.4 setosa
8 2.9 setosa
9 3.7 setosa
10 4.0 setosa
However, that was suggested more than 2 years ago. A more proper usage of latest dplyr functionalities would be:
distinct(iris, across(all_of(exclude.columns)))
It is not entirely clear to me whether you would like to keep only the exclude.columns or actually exclude them; if the latter then you just put minus in front i.e. distinct(iris, across(-all_of(exclude.columns))).
Your objective sounds unclear. Are you trying to get all distinct rows across all columns except $Species and $Sepal.Width? If so, that doesn't make sense.
Let's say two rows are the same in all other variables except for $Sepal.Width. Using distinct() in the way you described would throw out the second row because it was not distinct from the first. Except that it was in the column you ignored.
You need to rethink your objective and whether it makes sense.
If you are just worried about duplicate rows, then
data %>%
distinct(across(everything()))
will do the trick.

External functions inside filter of dplyr in R

How does an external function inside dplyr::filter know the columns just by their names without the use of the data.frame from which it is coming?
For example consider the following code:
filter(hflights, Cancelled == 1, !is.na(DepDelay))
How does is.na know that DepDelay is from hflights? There could possibly have been a DepDelay vector defined elsewhere in my code. (Assuming that hflights has columns named 'Cancelled', 'DepDelay').
In python we would have to use the column name along with the name of the dataframe. Therefore here I was expecting something like
!is.na(hflights$DepDelay)
Any help would be really appreciated.
While I'm not an expert enough to give a precise answer, hopefully I won't lead you too far astray.
It is essentially a question of environment. filter() first looks for any vector object within the data frame environment named in its first argument. If it doesn't find it, it will then go "up a level", so to speak, to the global environment and look for any other vector object of that name. Consider:
library(dplyr)
Species <- iris$Species
iris2 <- select(iris, -Species) # Remove the Species variable from the data frame.
filter(iris2, Species == "setosa")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.1 3.5 1.4 0.2
#> 2 4.9 3.0 1.4 0.2
#> 3 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 5.0 3.6 1.4 0.2
More information on the topic can be found here (warning, the book is a work in progress).
Most functions from the dplyr and tidyr packages are specifically designed to handle data frames, and all of those functions require the name of the data frame as their first argument. This allows for usage of the pipe (%>%) which allows to build a more intuitive workflow. Think of the pipe as the equivalent of saying "... and then ...". In the context shown above, you could do:
iris %>%
select(-Species) %>%
filter(Species == "setosa")
And you get the same output as above. Combining the concept of the pipe and focusing the lexical scope of variables to the referenced data frames is meant to lead to more readable code for humans, which is one of the principles of the tidyverse set of packages, which both dplyr and tidyr are components of.

Converting Data into two separate groups

I have simulated a Data for two groups coming from a multivariate normal distribution in R as per below:
#Package to generate a multivariate normal distribution
library(mvtnorm)
#The number of simulated variables that can be changed
p=5
set.seed(30)
#Generating the eigenvalues from a uniform distribution.
m=p
eigval <- runif(m,0.25,1)
#Generating a positive symmetric matrix (this will be used as the covariance matrix for generation of the data.
#Ravi Varadhan(2008)
shat <- matrix(ncol=m, rnorm(m^2))
decomp <- qr(shat)
Q <- qr.Q(decomp)
R <- qr.R(decomp)
d <- diag(R)
ph <- d/abs(d)
O <- Q%*%diag(ph)
shat <- t(O)%*%diag(eigval)%*%(O)
#Variance-covariance matrix for the data generation.
sig <- shat
#Mean vectors for two groups where the parameters may be changed accordingly.
m1 <- runif(p,0.1,0.2)
m2 <- runif(p,0.4,0.9)
#Euclidean distance between two groups
dist(rbind(m1,m2), method = "euclidean")
#The number of observations from group1
n1 <- 30
#The number of observation from group2
n2 <- 70
#The total number of observations
n <- n1+n2
#Group Identifier where '1' represent group 1 and '2' represent group 2
G1 <- rep(1,n1)
G2 <- rep(2,n2)
G <- c(G1,G2)
#Generate Data from group
library(mvtnorm)
g1 <- rmvnorm(n=n1, mean=m1, sigma=sig)
g2 <- rmvnorm(n=n2, mean=m2, sigma=sig)
g <-rbind(g1,g2)
Data <- data.frame(G, DV1=g[ , 1], DV2=g[ , 2], DV3=g[ ,3], DV4=g[,4], DV5=g[ ,5])
Now I want to apply the QDA function on this simulated data by using
the below coding which was found online:
https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/qda.html
However in this example it is said that the in-built IRIS data has been split into a data arranged as a 3-dimensional array of size 50 by 4 by 3, as represented by S-PLUS. (see - https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html)
Can someone tell me how any data can be split into n x m x p?
Not certain if you want an answer to your code or to the question about iris3. I'll talk about the latter for a moment.
The fact that it is a tidy array with 3 dimensions is convenience and for demonstration. It works because Edgar Anderson harvested exactly 50 samples of each species. There is nothing in the immediate documentation that suggests there is a relevant pairing between the first setosa and the first virginica, so the data is not paired. Unfortunately, by arranging species as planes in the cube, it suggests this paired relationship.
Consider this: had Edgar instead sampled 51 setosa instead of 50, but kept the other two species at 50, how would the array look? One of the planes would be a little taller than the other two, not a matrix. What if he sampled the 50 setosa in a different order (because it is not stated that order matters). The array would be different, and analysis that looks at the 3rd margin (iris3[1,1,]) would return different results, but the actual data really didn't change.
So, I believe the fact that it is in a perfectly-arrange 3-D matrix is for the purposes of dealing with multi-dimensional data, not because the data actually belong in that orientation.
EDIT
Given that you want to know how to transform (any) data from 2D to a 3D array, here's an example using iris. This makes a couple of assumptions:
All of the data is of the same class. For example, in the example below I remove the $Species column; because an array requires everything internally to be the same class, if I did not remove it then all of the numbers would be converted to character, probably not what you want.
The pairing within the added dimension is actually relevant, as I discussed above. This process works just fine if the data is not paired, then it is perfectly logical to think that with other data, there may be different counts for data in the different categories.
Similar (and tied) to #2, all categories should have the same amount of data. This can be waved-away if you are willing to accept rows of NA to extend in the shorter categories, but that seems a bit sloppy to me.
Base R
First, we split the current 2D data into groups, conveniently (but necessarily) resulting in elements with the same dimensions (50 x 4). The -5 removes the fifth column, $Species, so that our next step using as.matrix will not convert numeric to character.
irislist <- by(iris, iris$Species, `[`, -5)
Pre-populate a 3D array per the dimensions of the source data.
mtx <- array(NA, dim = c(dim(irislist[[1]]), length(irislist)))
This might be done with one of the *apply functions, but I couldn't get it to work generically. Perhaps somebody can comment with a suggestion.
for (i in seq_along(irislist)) mtx[,,i] <- as.matrix(irislist[[i]])
The 3D matrix is made! It might be nice to add dimension names to it, though not strictly required:
dimnames(mtx) <- list(NULL, colnames(irislist[[1]]), names(irislist))
mtx
# , , setosa
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 5.1 3.5 1.4 0.2
# [2,] 4.9 3.0 1.4 0.2
# [3,] 4.7 3.2 1.3 0.2
# [4,] 4.6 3.1 1.5 0.2
# [5,] 5.0 3.6 1.4 0.2
# ...snip...
abind
This can also be done using the abind, without the need to pre-allocation mtx, go through a for loop, or do any dimension naming:
library(abind)
mtx2 <- do.call("abind", c(irislist, list(along = 3)))
str(mtx)
# num [1:50, 1:4, 1:3] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# - attr(*, "dimnames")=List of 3
# ..$ : NULL
# ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
# ..$ : chr [1:3] "setosa" "versicolor" "virginica"
Wrap-Up
It isn't obvious how this would work with your data. When I ran your code, I ended up with six columns, only one of which (Data$G) would seem to be something you could split into another dimension (i.e., it looks like it could be categorical). Unfortunately:
table(Data$G)
# 1 2
# 30 70
and per my third bullet, this doesn't work.

Create a new (identical) data frame by sampling an existing data frame column-wise

I am trying to create a new data frame which is identical in the number of columns (but not rows) of an existing data frame. All columns are of identical type, numeric. I need to sample each column of the original data frame (n=241 samples, replace=T) and add those samples to the new data frame at the same column number as the original data frame.
My code so far:
#create the new data frame
tree.df <- data.frame(matrix(nrow=0, ncol=72))
#give same column names as original data frame (data3)
colnames(tree.df)<-colnames(data3)
#populate with NA values
tree.df[1:241,]=NA
#sample original data frame column wise and add to new data frame
for (i in colnames(data3)){
rbind(sample(data3[i], 241, replace = T),tree.df)}
The code isn't working out. Any ideas on how to get this to work?
Use the fact that a data frame is a list, and pass to lapply to perform a column-by-column operation.
Here's an example, taking 5 elements from each column in iris:
as.data.frame(lapply(iris, sample, size=5, replace=TRUE))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.7 3.2 1.7 0.2 versicolor
## 2 5.8 3.1 1.5 1.2 setosa
## 3 6.0 3.8 4.9 1.9 virginica
## 4 4.4 2.5 5.3 0.2 versicolor
## 5 5.1 3.1 3.3 0.3 setosa
There are several issues here. Probably the one that is causing things not to work is that you are trying to access a column of the data frame data3. To do that, you use the following data3[, i]. Note the comma. That separates the row index from the column index.
Additionally, since you already know how big your data frame will be, allocate the space from the beginning:
tree.df <- data.frame(matrix(nrow = 241, ncol = 72))
tree.df is already prepopulated with missing (NA) values so you don't need to do it again. You can now rewrite your for loop as
for (i in colnames(data3)){
tree.df[, i] <- sample(data3[, i], 241, replace = TRUE)
}
Notice I spelled out TRUE. This is better practice than using T because T can be reassigned. Compare:
T
T <- FALSE
T
TRUE <- FALSE

Resources