I'm using R, and I'm a beginner. I have two large lists (30K elements each). One is called descriptions and where each element is (maybe) a tokenized string. The other is called probes where each element is a number. I need to make a dictionary that mapsprobes to something in descriptions, if that something is there. Here's how I'm going about this:
probe2gene <- list()
for (i in 1:length(probes)){
strings<-strsplit(descriptions[i]), '//')
if (length(strings[[1]]) > 1){
probe2gene[probes[i]] = strings[[1]][2]
}
}
Which works fine, but seems slow, much slower than the roughly equivalent python:
probe2gene = {}
for p,d in zip(probes, descriptions):
try:
probe2gene[p] = descriptions.split('//')[1]
except IndexError:
pass
My question: is there an "R-thonic" way of doing what I'm trying to do? The R manual entry on for loops suggests that such loops are rare. Is there a better solution?
Edit: a typical good "description" looks like this:
"NM_009826 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// AB070619 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421 /// ENSMUST00000027040 // Rb1cc1 // RB1-inducible coiled-coil 1 // 1 A2 // 12421"
a bad "description: looks like this
"-----"
though it can quite easily be some other not-very-helpful string. Each probe is simply a number. The probe and description vectors are the same length, and completely correspond to each other, i.e. probe[i] maps to description[i].
It's usually better in R if you use the various apply-like functions, rather than a loop. I think this solves your problem; the only drawback is that you have to use string keys.
> descriptions <- c("foo//bar", "")
> probes <- c(10, 20)
> probe2gene <- lapply(strsplit(descriptions, "//"), function (x) x[2])
> names(probe2gene) <- probes
> probe2gene <- probe2gene[!is.na(probe2gene)]
> probe2gene[["10"]]
[1] "bar"
Unfortunately, R doesn't have a good dictionary/map type. The closest I've found is using lists as a map from string-to-value. That seems to be idiomatic, but it's ugly.
If I understand correctly you are looking to save each probe-description combination where the there is more than one (split) value in description?
Probe and Description are the same length?
This is kind of messy but a quick first pass at it?
a <- list("a","b","c")
b <- list(c("a","b"),c("DEF","ABC"),c("Z"))
names(b) <- a
matches <- which(lapply(b, length)>1) #several ways to do this
b <- lapply(b[matches], function(x) x[2]) #keeps the second element only
That's my first attempt. If you have a sample dataset that would be very useful.
Best regards,
Jay
Another way.
probe<-c(4,3,1)
gene<-c('red//hair','strange','blue//blood')
probe2gene<-character()
probe2gene[probe]<-sapply(strsplit(gene,'//'),'[',2)
probe2gene
[1] "blood" NA NA "hair"
In the sapply, we take advantage of the fact that in R the subsetting operator is also a function named '[' to which we can pass the index as an argument. Also, an out-of-range index does not cause an error but gives a NA value. On the left hand of the same line, we use the fact that we can pass a vector of indices in any order and with gaps.
Here's another approach that should be fast. Note that this doesn't
remove the empty descriptions. It could be adapted to do that or you
could clean those in a post processing step using lapply. Is it the
case that you'll never have a valid description of length one?
make_desc <- function(n)
{
word <- function(x) paste(sample(letters, 5, replace=TRUE), collapse = "")
if (runif(1) < 0.70)
paste(sapply(seq_len(n), word), collapse = "//")
else
"----"
}
description <- sapply(seq_len(10), make_desc)
probes <- seq_len(length(description))
desc_parts <- strsplit(description, "//", fixed=TRUE, useBytes=TRUE)
lens <- sapply(desc_parts, length)
probes_expand <- rep(probes, lens)
ans <- split(unlist(desc_parts), probes_expand)
> description
[1] "fmbec"
[2] "----"
[3] "----"
[4] "frrii//yjxsa//wvkce//xbpkc"
[5] "kazzp//ifrlz//ztnkh//dtwow//aqvcm"
[6] "stupm//ncqhx//zaakn//kjymf//swvsr//zsexu"
[7] "wajit//sajgr//cttzf//uagwy//qtuyh//iyiue//xelrq"
[8] "nirex//awvnw//bvexw//mmzdp//lvetr//xvahy//qhgym//ggdax"
[9] "----"
[10] "ubabx//tvqrd//vcxsp//rjshu//gbmvj//fbkea//smrgm//qfmpy//tpudu//qpjbu"
> ans[[3]]
[1] "----"
> ans[[4]]
[1] "frrii" "yjxsa" "wvkce" "xbpkc"
Related
I feel like i've forgotten something very obvious here...
Let's say we have two lists, a and b, with differing lengths:
a <- list(me = "you1", they = "our1", our = "till1", grow = "NOPE1")
b <- list(me = "my2", their = "his2", our = "aft2", new = "noise2",
they = "now2", b_names = "thurs2")
We want to replace the items in a with corresponding items from b, if an item in b has the same name as an item in a.
Manually, essentially this would equate to replacing: me, our, they in list a from those items in list b.
For the life of me the only approach i'm coming up with is using Reduce rather than match or %chin% etc, to find the intersection of names and then always using the last list object as the look-up table. I suppose you really don't need to Reduce since intersect would work find on it's own.. but regardless...
Isn't there a simpler, more straight forward way that I am simply forgetting?
Here's my code.. it works..but that's not the point.
reduce.names <- function(...){
vars <- list(...)
if(length(vars) > 2){
return("only 2 lists allowed...")
}else {
Reduce(intersect, Map(names,vars))
}
}
> matched_names <- reduce.names(a,b)
> matched_names
[1] "me" "they" "our"
a[matched_names] <- b[matched_names]
> a
$me
[1] "my2"
$they
[1] "now2"
$our
[1] "aft2"
$grow
[1] "NOPE1"
here's another approach that works... but just seems redundant and sketchy...
> merge(a,b) %>% .[names(a)]
$me
[1] "my2"
$they
[1] "now2"
$our
[1] "aft2"
$grow
[1] "NOPE1"
Any advice/alternate approach/reminder of some base function I have completely forgotten would be greatly appreciated. Thanks.
I am using the following code in a loop, I am just replicating the part which I am facing the problem in. The entire code is extremely long and I have removed parts which are running fine in between these lines. This is just to explain the problem:
for (j in 1:2)
{
assign(paste("numeric_data",j,sep="_"),unique_id)
for (i in 1:2)
{
assign(paste("numeric_data",j,sep="_"),
merge(eval(as.symbol(paste("numeric_data",j,sep="_"))),
eval(as.symbol(paste("sd_1",i,sep="_"))),all.x = TRUE))
}
}
The problem that I am facing is that instead of assign in the second step, I want to use (eval+paste)
for (j in 1:2)
{
assign(paste("numeric_data",j,sep="_"),unique_id)
for (i in 1:2)
{
eval(as.symbol((paste("numeric_data",j,sep="_"))))<-
merge(eval(as.symbol(paste("numeric_data",j,sep="_"))),
eval(as.symbol(paste("sd_1",i,sep="_"))),all.x = TRUE)
}
}
However R does not accept eval while assigning new variables. I looked at the forum and everywhere assign is suggested to solve the problem. However, if I use assign the loop overwrites my previously generated "numeric_data" instead of adding to it, hence I get output for only one value of i instead of both.
Here is a very basic intro to one of the most fundamental data structures in R. I highly recommend reading more about them in standard documentation sources.
#A list is a (possible named) set of objects
numeric_data <- list(A1 = 1, A2 = 2)
#I can refer to elements by name or by position, e.g. numeric_data[[1]]
> numeric_data[["A1"]]
[1] 1
#I can add elements to a list with a particular name
> numeric_data <- list()
> numeric_data[["A1"]] <- 1
> numeric_data[["A2"]] <- 2
> numeric_data
$A1
[1] 1
$A2
[1] 2
#I can refer to named elements by building the name with paste()
> numeric_data[[paste0("A",1)]]
[1] 1
#I can change all the names at once...
> numeric_data <- setNames(numeric_data,paste0("B",1:2))
> numeric_data
$B1
[1] 1
$B2
[1] 2
#...in multiple ways
> names(numeric_data) <- paste0("C",1:2)
> numeric_data
$C1
[1] 1
$C2
[1] 2
Basically, the lesson is that if you have objects with names with numeric suffixes: object_1, object_2, etc. they should almost always be elements in a single list with names that you can easily construct and refer to.
Tracemem is doing what I need it to, but it is also producing distracting visual clutter. Here is a simple example.
a<-1
b<-2
dummyfunction<-function(x,y){return(sum(x,y))}
dummyfunction(a,b)
[1] 3
Now, I want to do something more complex, first tracemem to see if the inputs are duplicated...
dummyfunction2<-function(x,y){if (tracemem(x)==tracemem(y)){return("Input vectors are identical")}
if(sum(x %in% y)>=length(x) & sum(y %in% x)>=length(y)){print("Something something.")}
return(sum(x,y))}
This does what I want if the inputs are duplicated...
dummyfunction2(a,a)
[1] "Input vectors are identical"
When they're not duplicated, though the function still works, it spews a bunch of confusing information.
dummyfunction2(a,b)
tracemem[0x0000000009824470 -> 0x000000000a7ced80]: match %in% dummyfunction2
tracemem[0x0000000009824500 -> 0x000000000a7cedb0]: match %in% dummyfunction2
tracemem[0x0000000009824500 -> 0x000000000a7cef90]: match %in% dummyfunction2
tracemem[0x0000000009824470 -> 0x000000000a7cc1a8]: match %in% dummyfunction2
[1] 3
I'm hoping to convince non-R users to try using a function with this issue, and output like this will certainly scare them off.
What is the most elegent way to remove this visual clutter without supressing potentially informative warnings. etc that may crop up in other portions of the function?
From http://stat.ethz.ch/R-manual/R-patched/library/base/html/tracemem.html :
"This function marks an object so that a message is printed whenever the internal code copies the object."
You could stick untracemem into the function to get around it:
dummyfunction3<-function(x,y){
if (tracemem(x)==tracemem(y)){return("Input vectors are identical")}
untracemem(x)
untracemem(y)
if(sum(x %in% y)>=length(x) & sum(y %in% x)>=length(y)){print("Something something.")}
return(sum(x,y))}
output:
a <- 1
b <- 2
dummyfunction3(a,a)
# [1] "Input vectors are identical"
dummyfunction3(a,b)
# [1] 3
Don't use tracemem(). Instead you could try pryr::address() which
just returns the memory address of the input.
devtools::install_github("hadley/pryr")
library(pryr)
x <- 1:10
y <- x
address(x)
## [1] "0x100a568c8"
address(y)
## [1] "0x100a568c8"
Using a basic function such as this:
myname<-function(z){
nm <-deparse(substitute(z))
print(nm)
}
I'd like the name of the item to be printed (or returned) when iterating through a list e.g.
for (csv in list(acsv, bcsv, ccsv)){
myname(csv)
}
should print:
acsv
bcsv
ccsv
(and not csv).
It should be noted that acsv, bcsv, and ccsvs are all dataframes read in from csvs i.e.
acsv = read.csv("a.csv")
bcsv = read.csv("b.csv")
ccsv = read.csv("c.csv")
Edit:
I ended up using a bit of a compromise. The primary goal of this was not to simply print the frame name - that was the question, because it is a prerequisite for doing other things.
I needed to run the same functions on four identically formatted files. I then used this syntax:
for(i in 1:length(csvs)){
cat(names(csvs[i]), "\n")
print(nrow(csvs[[i]]))
print(nrow(csvs[[i]][1]))
}
Then the indexing of nested lists was utilized e.g.
print(nrow(csvs[[i]]))
which shows the row count for each of the dataframes.
print(nrow(csvs[[i]][1]))
Then provides a table for the first column of each dataframe.
I include this because it was the motivator for the question. I needed to be able to label the data for each dataframe being examined.
The list you have constructed doesn't "remember" the expressions it was constructed of anymore. But you can use a custom constructor:
named.list <- function(...) {
l <- list(...)
exprs <- lapply(substitute(list(...))[-1], deparse)
names(l) <- exprs
l
}
And so:
> named.list(1+2,sin(5),sqrt(3))
$`1 + 2`
[1] 3
$`sin(5)`
[1] -0.9589243
$`sqrt(3)`
[1] 1.732051
Use this list as parameter to names, as Thomas suggested:
> names(mylist(1+2,sin(5),sqrt(3)))
[1] "1 + 2" "sin(5)" "sqrt(3)"
To understand what's happening here, let's analyze the following:
> as.list(substitute(list(1+2,sqrt(5))))
[[1]]
list
[[2]]
1 + 2
[[3]]
sqrt(5)
The [-1] indexing leaves out the first element, and all remaining elements are passed to deparse, which works because of...
> lapply(as.list(substitute(list(1+2,sqrt(5))))[-1], class)
[[1]]
[1] "call"
[[2]]
[1] "call"
Note that you cannot "refactor" the call list(...) inside substitute() to use simply l. Do you see why?
I am also wondering if such a function is already available in one of the countless R packages around. I have found this post by William Dunlap effectively suggesting the same approach.
I don't know what your data look like, so here's something made up:
csvs <- list(acsv=data.frame(x=1), bcsv=data.frame(x=2), ccsv=data.frame(x=3))
for(i in 1:length(csvs))
cat(names(csvs[i]), "\n")
I want to use information from a field and include it in a R function, e.g.:
data #name of the data.frame with only one raw
"(if(nclusters>0){OptmizationInputs[3,3]*beta[1]}else{0})" # this is the raw
If I want to use this information inside a function how could I do it?
Another example:
A=c('x^2')
B=function (x) A
B(2)
"x^2" # this is the return. I would like to have the return something like 2^2=4.
Use body<- and parse
A <- 'x^2'
B <- function(x) {}
body(B) <- parse(text = A)
B(3)
## [1] 9
There are more ideas here
Another option using plyr:
A <- 'x^2'
library(plyr)
body(B) <- as.quoted(A)[[1]]
> B(5)
[1] 25
A <- "x^2"; x <- 2
BB <- function(z){ print( as.expression(do.call("substitute",
list( parse(text=A)[[1]], list(x=eval(x) ) )))[[1]] );
cat( "is equal to ", eval(parse(text=A)))
}
BB(2)
#2^2
#is equal to 4
Managing expressions in R is very weird. substitute refuses to evaluate its first argument so you need to use do.call to allow the evaluation to occur before the substitution. Furthermore the printed representation of the expressions hides their underlying representation. Try removing the fairly cryptic (to my way of thinking) [[1]] after the as.expression(.) result.