Splitting an object in R - r

I would like to split an object in R according to the suffixes of the barcodes it contains. These end in '-n' where n is a number from 1 to 6. e.g. AAACCGTGCCCTCA-1, GAACCGTGCCCTCA-2, CATGCGTGCCCTCA-5, etc. I would like all the corresponding information about each barcode to be split accordingly as well. Here is some example code of an object, cds.
class(cds)
[1] "CellDataSet"
attr(,"package")
[1] "monocle"
split(cds, cds$barcode)
#not by individual barcodes, but by groups of those ending '-1', '-2',...,'-6'. So 6 new objects in total
Many thanks!
Abigail

Split does not work because you need to subset based on the columns. I am not sure if there is a split method defined for this class. You can try the following:
First to get something like your example:
library(monocle)
library(HSMMSingleCell)
library(Biostrings)
cds = load_HSMM()
class(cds)
[1] "CellDataSet"
attr(,"package")
[1] "monocle"
dim(cds)
Features Samples
47192 271
And to create a barcode for every sample:
bar = paste(names(oligonucleotideFrequency(DNAString("NNNNN"),5))[1:ncol(cds)],
sample(1:6,ncol(cds),replace=TRUE),sep="-")
head(bar)
[1] "AAAAA-3" "AAAAC-6" "AAAAG-5" "AAAAT-1" "AAACA-5" "AAACC-5"
Now we get the group, which is the suffix 1-6 :
cds$barcodes= bar
grp = sub("[A-Z]*[-]","",cds$barcodes)
To get one subset, for example, those will "-1", you can just do:
group1 = cds[,grp==1]
dim(group1)
Features Samples
47192 46
head(group1$barcodes)
[1] "AAAAT-1" "AACGA-1" "AAGCG-1" "AAGGG-1" "AAGTA-1" "AATAG-1"
To get your 6 groups, you can do the below, but check whether your machine has the memory to accommodate this!
subset_obj = lapply(unique(grp),function(i){
cds[,grp==i]
})
names(subset_obj) = unique(grp)

We can use sub to remove the -\\d+ and split the 'cds' based on that
split(cds, sub("-\\d+$", "", cds$barcode))

Related

R: How to get column names for columns that contain a certain word AND their associated index number?

I want to create a list of column names that contain the word "arrest" AND their associated index number. I do not want all the columns, so I DO NOT want to subset the arrest columns into a new data frame. I merely want to see the list of names and their index numbers so I can delete the ones I don't want from the original data frame.
I tried getting the column names and their associated index numbers by using the below codes, but they only gave one or the other.
This gives me their names only
colnames(x2009_2014)[grepl("arrest",colnames(x2009_2014))]
[1] "poss_cannabis_tot_arrests" "poss_drug_total_tot_arrests"
[3] "poss_heroin_coke_tot_arrests" "poss_other_drug_tot_arrests"
[5] "poss_synth_narc_tot_arrests" "sale_cannabis_tot_arrests"
[7] "sale_drug_total_tot_arrests" "sale_heroin_coke_tot_arrests"
[9] "sale_other_drug_tot_arrests" "sale_synth_narc_tot_arrests"
[11] "total_drug_tot_arrests"
This gives me their index numbers only
grep("county", colnames(x2009_2014))
[1] 93 168 243 318 393 468 543 618 693 768 843
But I want their name AND index number so that it looks something like this
[93] "poss_cannabis_tot_arrests"
[168] "poss_drug_total_tot_arrests"
[243] "poss_heroin_coke_tot_arrests"
[318] "poss_other_drug_tot_arrests"
[393] "poss_synth_narc_tot_arrests"
[468] "sale_cannabis_tot_arrests"
[543] "sale_drug_total_tot_arrests"
[618] "sale_heroin_coke_tot_arrests"
[693] "sale_other_drug_tot_arrests"
[768] "sale_synth_narc_tot_arrests"
[843] "total_drug_tot_arrests"
Lastly, using advice here, I used the below code, but it did not work.
K=sapply(x2009_2014,function(x)any(grepl("arrest",x)))
which(K)
named integer(0)
The person who provided the advice in the above link used
K=sapply(df,function(x)any(grepl("\\D+",x)))
names (df)[K]
Zo.A Zo.B
Which (k)
Zo.A Zo.B
2 4
I'd prefer the list I showed in the third block of code, but the code this person used provides a structure I can work with. It just did not work for me when I tried using it.
Hacky as a one-liner because I really dislike use <- inside a function call, but this should work:
setNames(
nm = matches <- grep("arrest", colnames(x2009_2014)),
colnames(x2009_2014)[matches]
)
Reproducible example:
setNames(nm = x <- grep("b|c", letters), letters[x])
# 2 3
# "b" "c"
Or write your own function that does it. Here I put it in a data frame, which seems nicer than a named vector:
grep_ind_value = function(pattern, x, ...) {
index = grep(x, pattern, ...)
value = x[index]
data.frame(index, value)
}

Rename row.name in data frame using matches or partial matches from a list

I have a data frame in R with 341 rows. I want to rename the row names using a list with 349 names. All 341 names will be in this list for sure. But not all of them will be perfect hits.
The data looks like this
rownames(df_RPM1)
[1] "LQNS02059392.1_11686_5p"
[2] "LQNS02277998.1_30984_3p"
[3] "LQNS02277998.1_30984_5p"
[4] "LQNS02277998.1_30988_3p"
[5] "LQNS02277998.1_30988_5p"
[6] "LQNS02277997.1_30943_3p"
[7] "miR-9|LQNS02278070.1_31740_3p"
[8] "miR-9|LQNS02278094.1_36129_3p"
head(inlist)
[1] "dpu-miR-2-03_LQNS02059392.1_11686_5p" "dpu-miR-10-P2_LQNS02277998.1_30984_3p"
[3] "dpu-miR-10-P2_LQNS02277998.1_30984_5p" "dpu-miR-10-P3_LQNS02277998.1_30988_3p"
[5] "dpu-miR-10-P3_LQNS02277998.1_30988_5p" "miR-9|LQNS02278070.1_31740_3p"
[6] "miR-9|LQNS02278094.1_36129_3p"
The order won't necessarily be the same in the two.
Can anyone suggest me how to do this in R?
Thanks a lot
Depends a lot what a "non-perfect hit" looks like. Assuming the row name is a substring of the real name, str_detect() does the job quite well:
library(tidyverse)
real_names <- c("dpu-miR-2-03_LQNS02059392.1_11686_5p",
"dpu-miR-10-P2_LQNS02277998.1_30984_3p",
"dpu-miR-10-P2_LQNS02277998.1_30984_5p",
"dpu-miR-10-P3_LQNS02277998.1_30988_3p",
"dpu-miR-10-P3_LQNS02277998.1_30988_5p",
"miR-9|LQNS02278070.1_31740_3p",
"miR-9|LQNS02278094.1_36129_3p")
str_which(real_names, "LQNS02059392.1_11686_5p")
#> [1] 1
So we can vectorize (I removed the element 6 which is not found in the example list):
pos <- map_int(rownames(df_RPM1), ~ str_which(real_names, fixed(.)))
pos
#> [1] 1 2 3 4 5 6 7
And all that's left is to change the row names:
rownames(df_RPM1) <- real_names[pos]
Of course, if a non-perfect hit means something more complicated, you may need to create a regex from the row names or something like that.

How to get the center and scale after using the scale function in R

It seems a silly question, but I have searched on line, but still did not find any sufficient reply.
My question is: suppose we have a matrix M, then we use the scale() function, how can we extract the center and scale of each column by writing a line of code (I know we can see the centers and scales..), but my matrix has lots of columns, it is cumbersome to do it manually.
Any ideas? Many thanks!
you are looking for the attributes function:
set.seed(1)
mat = matrix(rnorm(1000),,10) # Suppose you have 10 columns
s = scale(mat) # scale your data
attributes(s)#This gives you the means and the standard deviations:
$`dim`
[1] 100 10
$`scaled:center`
[1] 0.1088873669 -0.0378080766 0.0296735350 0.0516018586 -0.0391342406 -0.0445193567 -0.1995797418
[8] 0.0002549694 0.0100772648 0.0040650015
$`scaled:scale`
[1] 0.8981994 0.9578791 1.0342655 0.9916751 1.1696122 0.9661804 1.0808358 1.0973012 1.0883612 1.0548091
These values can also be obtained as:
colMeans(mat)
[1] 0.1088873669 -0.0378080766 0.0296735350 0.0516018586 -0.0391342406 -0.0445193567 -0.1995797418
[8] 0.0002549694 0.0100772648 0.0040650015
sqrt(diag(var(mat)))
[1] 0.8981994 0.9578791 1.0342655 0.9916751 1.1696122 0.9661804 1.0808358 1.0973012 1.0883612 1.0548091
you get a list that you can subset the way you want:
or you can do
attr(s,"scaled:center")
[1] 0.1088873669 -0.0378080766 0.0296735350 0.0516018586 -0.0391342406 -0.0445193567 -0.1995797418
[8] 0.0002549694 0.0100772648 0.0040650015
attr(s,"scaled:scale")
[1] 0.8981994 0.9578791 1.0342655 0.9916751 1.1696122 0.9661804 1.0808358 1.0973012 1.0883612 1.0548091

R: How to remove quotation marks in a vector of strings, but maintain vector format as to call each individual value?

I want to create a vector of names that act as variable names so I can then use themlater on in a loop.
years=1950:2012
for(i in 1:length(years))
{
varname[i]=paste("mydata",years[i],sep="")
}
this gives:
> [1] "mydata1950" "mydata1951" "mydata1952" "mydata1953" "mydata1954" "mydata1955" "mydata1956" "mydata1957" "mydata1958"
[10] "mydata1959" "mydata1960" "mydata1961" "mydata1962" "mydata1963" "mydata1964" "mydata1965" "mydata1966" "mydata1967"
[19] "mydata1968" "mydata1969" "mydata1970" "mydata1971" "mydata1972" "mydata1973" "mydata1974" "mydata1975" "mydata1976"
[28] "mydata1977" "mydata1978" "mydata1979" "mydata1980" "mydata1981" "mydata1982" "mydata1983" "mydata1984" "mydata1985"
[37] "mydata1986" "mydata1987" "mydata1988" "mydata1989" "mydata1990" "mydata1991" "mydata1992" "mydata1993" "mydata1994"
[46] "mydata1995" "mydata1996" "mydata1997" "mydata1998" "mydata1999" "mydata2000" "mydata2001" "mydata2002" "mydata2003"
[55] "mydata2004" "mydata2005" "mydata2006" "mydata2007" "mydata2008" "mydata2009" "mydata2010" "mydata2011" "mydata2012"
All I want to do is remove the quotes and be able to call each value individually.
I want:
>[1] mydata1950 mydata1951 mydata1952 mydata1953, #etc...
stored as a variable such that
varname[1]
> mydata1950
varname[2]
> mydata1951
and so on.
I have played around with
cat(varname[i],"\n")
but this just prints values as one line and I can't call each individual string. And
gsub("'",'',varname)
but this doesn't seem to do anything.
Suggestions? Is this possible in R? Thank you.
There are no quotes in that character vector's values. Use:
cat(varname)
.... if you want to see the unquoted values. The R print mechanism is set to use quotes as a signal to your brain that distinct values are present. You can also use:
print(varname, quote=FALSE)
If there are that many named objects in you workspace, then you need desperately to learn to use lists. There are mechanisms for "promoting" character values to names, but this would be seen as a failure on your part to learn to use the language effectively:
var <- 2
> eval(as.name('var'))
[1] 2
> eval(parse(text="var"))
[1] 2
> get('var')
[1] 2

store summary output in a list of tables or matrix

How to read the following vector "c" of strings into a list of tables? Which way is the shortest read.table strsplit? e.g. I cant see how to read the table Edit:c[4:6] a[4:6] in one command.
require(car)
m<-matrix(rnorm(16),4,4,byrow=T)
a<-Anova(lm(m~1),type=3,idata=data.frame(treatment=factor(1:4)),idesign=~treatment)
c<-capture.output(summary(a,multivariate=F))
c
This returns lines 4:6
c[4:6]
Now if you wanted to parse this I would do it in two steps. First on the column values from rows 5:6 and then add back the names.
> vals <- read.table(text=c[5:6])
> txt <- " \t SS\t num Df\t Error SS\t den Df\t F\t Pr(>F)"
> names(vals) <- names(read.delim(text=txt))
> vals
X SS num.Df Error.SS den.Df F Pr..F.
1 (Intercept) 0.57613392 1 0.4219563 3 4.09616 0.13614
2 treatment 1.85936442 3 8.2899759 9 0.67287 0.58996
EDIT --
you could look at the source code of the summary function and calculate the quantities required by yourself
getAnywhere(summary.Anova.mlm)
The original idea seems not to work.
c2 <- summary(a)
# find out what 'properties' the summary object has
# turns out, it is just the Anova object
class(c2) <- "list"
names(c2)
This returns
[1] "SSP" "SSPE" "P" "df" "error.df"
[6] "terms" "repeated" "type" "test" "idata"
[11] "idesign" "icontrasts" "imatrix" "singular"
and we can get access them
c2$SSP
c2$SSPE
It seems not a good idea to use R internal c function as a variable name

Resources