I am analyzing six single-cell RNA-seq datasets with Seurat package.
These 6 datasets were acquired through each different 10X running, then combined with batch effect-corrected via Seurat function "FindIntegrationAnchors".
Meanwhile, among the 6 datasets, data 1, 2, 3 and 4 are "untreated" group, while data 5 and 6 belongs to "treated" group.
I merged all the 6 datasets together with batch-corrected, but I also need to compare features of "untreated" vs "treated".
How can I group data 1,2,3 and 4 into "untreated group", and data 5 and 6 into "treated group", and then perform downstream analysis?
Thanks.
One quick and dirty way to do this, is to add the information before merging the Seurat objects:
...
so_samples[[1]]#meta.data$treatment <- "control"
so_samples[[2]]#meta.data$treatment <- "control"
so_samples[[3]]#meta.data$treatment <- "control"
so_samples[[4]]#meta.data$treatment <- "control"
so_samples[[5]]#meta.data$treatment <- "treated"
so_samples[[6]]#meta.data$treatment <- "treated"
...
anchors <- FindIntegrationAnchors(object.list = so_samples, dims = 1:20)
so_all_samples <- IntegrateData(anchorset = anchors, dims = 1:20)
In general, it would be better to load such meta data from a file and join it to the seurat object without such error-prone copy-paste code. Also note that it is in general a bad idea to modify R S4 objects (those where you can access elements with #) like this, but the functions provided to modify Seurat objects provided by the Seurat package are so cumbersome to use that I doubt they will ever change the underlying data structure.
Related
How do I extract the 'VarCompContrib" column in the data frame produced using the gageRR function in R?
This is for a GageRR analysis of a measurement system. I'm trying to make a very user friendly program where other people can just enter the information required, like number of operators, parts, and measurements, as well as the measurements themselves, and output the correct analysis. I'm gonna use an if-statement later on to do the "analysis" portion, but I am having trouble actually managing the data frame produced with gageRR.
library(MASS)
library(Rsolnp)
library(qualityTools)
design = gageRRDesign(Operators=3, Parts=10, Measurements=2, randomize=FALSE)
response(design) = c(23,22,22,22,22,25,23,22,23,22,20,22,22,22,24,25,27,28,
23,24,23,24,24,22,22,22,24,23,22,24,20,20,25,24,22,24,21,20,21,22,21,22,21,
21,24,27,25,27,23,22,25,23,23,22,22,23,25,21,24,23)
gdo=gageRR(design)
plot(gdo)
I am looking to get a 7 number column vector under VarCompContrib
For starters, you can look at the structure of gdo with str(gdo). From there, we see that Varcomp is a slot, so we can access it with gdo#Varcomp and just convert it to a data.frame:
library(qualityTools)
design <- gageRRDesign(Operators = 3, Parts = 10, Measurements = 2, randomize = FALSE)
response(design) <- c(
23,22,22,22,22,25,23,22,23,22,20,22,22,22,24,25,27,28,23,24,23,24,24,22,22,22,24,23,22,24,
20,20,25,24,22,24,21,20,21,22,21,22,21,21,24,27,25,27,23,22,25,23,23,22,22,23,25,21,24,23
)
gdo <- gageRR(design)
data.frame(gdo#Varcomp)
# totalRR repeatability reproducibility a a_b bTob totalVar
# 1 1.66441 1.209028 0.4553819 0.4553819 0 1.781211 3.445621
I have about 15 different Data sets in R that I need to merge into 1 big Data set.
Combining them will create a data set of about 1120 variables and about 1500 observations.
There is no problem merging the first 5 data sets (getting to about 700 variables), but when trying to merge the 6th/7th dataset R either get stuck or have an error msg of:
Error: cannot allocate vector of size 10.7 Mb
I have tried different ways to write this code (functions/loops), but this is the simplest way, by which I understood that it gets stuck on the 6th dataset:
#Merging the first two data sets
#bindedDataNames is a chr vector with the names of all the datasets that need
#to be merged.
Age11_twins_22022017 <- merge(eval(parse(text = bindedDataNames[1]))
[,-c(1:2)],
eval(parse(text = bindedDataNames[2]))
[,-c(1:3)],
by=c("ifam","ID"))
#Loop to merge all datasets. With print I saw it goes without a problem until
#the 6th dataset
for (cnt2 in 3:17) {
print(cnt2)
Age11_twins_22022017 <- merge(Age11_twins_22022017,
eval(parse(text = bindedDataNames[cnt2]))
[,-c(1:3)],
by=c("ifam","ID"))
}
I saw that there are packages for big data such as bigmemory or ff, but couldn't really figure out how to write the merge result (which is different from step to step) into this big matrix.
Is it even possible in R to merge several datasets into a really big one?
I would want to both be able to export this file to later use in SPSS and be able to do statistical analysis in R itself.
I have some hierarchical data, e.g.,
> library(dplyr)
> df <- data_frame(id = 1:6, parent_id = c(NA, 1, 1, 2, 2, 5))
> df
Source: local data frame [6 x 2]
id parent_id
(int) (dbl)
1 1 NA
2 2 1
3 3 1
4 4 2
5 5 2
6 6 5
I would like to plot the tree in a "top down" view through a circle packing plot:
http://bl.ocks.org/mbostock/4063530
The above link is for a d3 library. Is there an equivalent that allows me to make such a plot in ggplot2?
(I want this plot in a shiny app, which does support d3, but I haven't used d3 before and am unsure about the learning curve. If d3 is the obvious choice, I will try to get that working instead. Thanks.)
There were two steps: (1) aggregate the data, then (2) convert to json. After that, all the javascript has been written in that example page, so you can just plug in the resulting json data.
Since the aggregated data should have a similar structure to a treemap, we can use the treemap package to do the aggregation (could also use a loop with successive aggregation). Then, d3treeR (from github) is used to convert the treemap data to a nested list, and jsonlite to convert the list to json.
I'm using some example data GNI2010, found in the d3treeR package. You can see all of the source files on plunker.
library(treemap)
library(d3treeR) # devtools::install_github("timelyportfolio/d3treeR")
library(data.tree)
library(jsonlite)
## Get treemap data using package treemap
## Using example data GNI2010 from d3treeR package
data(GNI2010)
## aggregate by these: continent, iso3,
## size by population, and color by GNI
indexList <- c('continent', 'iso3')
treedat <- treemap(GNI2010, index=indexList, vSize='population', vColor='GNI',
type="value", fun.aggregate = "sum",
palette = 'RdYlBu')
treedat <- treedat$tm # pull out the data
## Use d3treeR to convert to nested list structure
## Call the root node 'flare' so we can just plug it into the example
res <- d3treeR:::convert_treemap(treedat, rootname="flare")
## Convert to JSON using jsonlite::toJSON
json <- toJSON(res, auto_unbox = TRUE)
## Save the json to a directory with the example index.html
writeLines(json, "d3circle/flare.json")
I also replaced the source line in the example index.html to
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js"></script>
Then fire up the index.html and you should see
To create the shiny bindings should be doable using htmlwidgets and following some examples (the d3treeR source has some). Note that certain things aren't working, like the coloring. The json that gets stored here actually contains a lot of information about the nodes (all the data aggregated using the treemap) that you could leverage in the figure.
New user to R (like 2 days of use new) and coming from MATLAB, syntax nuances are driving me a little crazy. If anyone can point me in a direction on this topic I would really appreciate it. I have this dataset (fl1.back), that has 32 variables (columns) and 513 measurements (rows), and I want to create a table with basic stat tables of 9 of the 32 columns of data. There's a separate datset(fl2.back) that I would also like to pull 1 column of data from for the final table.
Here's the code I used to do the above tasks for 1 of the columns of data (sodium measurements) from fl1.back and fl2.back:
fl1.back <- read.delim("web.flat",comment.char="#",colClasses="character")
fl1.back <- fl1.back[-1,]
fl2.back <- read.delim("web.flat2",comment.char="#",colClasses="character")
fl2.back <- fl2.back[-1,]
head(fl1.back)
head(fl2.back)
#for rep criteria for sodium
back.sod.rep <- fl2.back[fl2.back$P00930!="",]
back.sod.rep$P00930 <- as.numeric(back.sod.rep$P00930)
back.sod.rep$P00930
#for samples...sodium
back.sod <- fl1.back[fl1.back$P00930!="",]
back.sod$P00930 <- as.numeric(back.sod$P00930)
back.sod$P00930
head(back.sod)
back.sod.summ <- data.frame("Sodium")
back.sod.summ
colnames(back.sod.summ) <- "Compound"
back.sod.summ$WQ_crit <- "20 mg/L"
back.sod.summ$n <- nrow(back.sod)
back.sod.summ$n_det <- nrow(back.sod[back.sod$R00930!="<",])
back.sod.summ$min <- min(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$max <- max(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$mean <- mean(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$median <- median(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$percent_samp_det <- 100*(back.sod.summ$n_det/back.sod.summ$n)
back.sod.summ$percent_samp_above_crit <- 100*(length(back.sod[back.sod$P00930>20,"P00930"])/back.sod.summ$n)
back.sod.summ$percent_rep_above_crit <- (sum(back.sod.rep$P00930>=20)/(nrow(back.sod.rep)))
back.sod$P00930
length(back.sod[back.sod$P00930>back.sod.summ$WQ_crit,"P00930"])
back.sod.summ
final <- data.frame(back.sod.summ)
Instead of rewriting/copying and pasting the above code to create the data frame final, I would like to loop over the two datasets since I'm looking to repeat the same task, just on different columns of data. I really don't know where to start, and there doesn't seem to be much literature on for loops in R.
Any insight is appreciated!
Here is an example of what I think you want with the iris dataset:
library(plyr)
dlply(iris, .(Species), summary)
This can be extended if you need additional stats. Anyway, you probably should use (as I show above) the "split-apply-combine" approach as implemented in various functions and packages.
I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?
The car package has a recode function. See it's help page for worked examples.
In fact an argument could be made that this should be a closed question:
Why is recode in R not changing the original values?
How to recode a variable to numeric in R?
Recode/relevel data.frame factors with different levels
And a few more questions easily identifiable with a search: [r] recode
EDIT:
I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.
my_recode <- function(fac, levslist){ nfac <- factor(fac);
inlevs <- levels(nfac);
othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
# levslist of the form :::: list(
# animal = c("cow", "pig"),
# bird = c("eagle", "pigeon") )
levels(nfac)<- c(levslist, all_others =othrlevs); nfac}
df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"),
stringsAsFactors = FALSE)
df$type <- my_recode(df$name, list(
animal = c("cow", "pig"),
bird = c("eagle", "pigeon") ) )
df
#-----------
name type
1 cow animal
2 pig animal
3 eagle bird
4 pigeon bird
5 zebra all_others
You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)
If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.
It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.
The videos at the official web site are useful.
There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.
You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.
The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
Now that we have a sorted list of unique values, we can create a listing of recodes:
Let's say we wanted to rename A through G as just A
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
And you'd repeat the process above eliminating or group the unique names as appropriate.