suppose I have the following:
a <- vector('list',50)
for(i in 1:50)
{
a[[i]] <- list(path=paste0("file",sample(0:600,1)),contents=sample(1:5,10*i,replace=TRUE))
}
Now, for example; I want to retrieve the contents of file45(assuming it exists in this randomly generated data) as fast as possible.
I have tried the following:
contents <- unlist(Filter(function(x) x$path=="file45",a),recursive=FALSE)$contents
However, the list searching overhead makes reading from memory even slower than reading directly from disk (to some extent).
Is there any other way of retrieving the contents in something reasonably faster than reading from disk ideally O(1) ?
edit: assume that there are no duplicate filepaths in my sublists and that there are largely more than 50 sublists
Use the names attribute to track the items instead:
a <- vector('list',50)
for(i in 1:50)
{
a[[i]] <- list(contents=sample(1:5,10*i,replace=TRUE))
}
names(a) <- paste0("file",sample(1:600,50))
a[["file45"]]
NULL
a[["file25"]]
$contents
[1] 3 1 3 1 2 5 1 5 1 2 3 1 4 1 1 4 1 5 1 5 1 4 5 2 5 2 2 5 1 1
Try the following:
a[sapply(a, function(x) x$path == "file45")][[1]]$contents
Related
How Do I intersect between multiple samples?
I have 29 lists of concatenates I build according to gene name, cc change, coordinate. Each list is 400-800 long. I need to build a table showing how many variants shared among two lists for all 812 combinations. Is there a way I can do this in R?
For example: If I have 4 lists.
A<-c("TSC22112517","SLC141T43309911","RAD51D33446609","WRN31024638")
B<-c("TSC22112517","SLC14A143309911","RHBDF274474996","WRN31024638")
C<-c("TSC22112517","SLC14A143309911","RAD51D33446609","MEN164575556")
D<-c("FANCM45665468","SLC14A143309911","RAD51D33446609","MEN164575556")
I just need to find how many variants are shard among each other.
AB<-length(intersect(A,B))
give me the # of variants shared by A and B which is 3.
Then I can get a table like below showing # of shared variants:
A B C D
A 4 3 2 2
B 3 4 3 2
C 2 3 4 2
D 2 2 2 4
How to do it for large # of lists?
I have 29 lists and each has 600 variants.
You could try something like this: I do a lot of things in lists...
#x is your data in list() format
shared<-list()
for (i in 1:29){
shared[[i]]<-list()
for (j in 1:29){
if (i != j){
shared[[i]][[j]]<-x[[i]][x[[i]][,2]==x[[j]][,2]]
}
}
}
So happy to figure it out
x<- list()
shared<-matrix(1:841,ncol=29)
temp<-NULL
for (i in 1:29){
for (j in 1:29){
temp[j] <- length(intersect(x[[i]][[1]],x[[j]][[1]]))
}
shared[,i] <- matrix(temp)
}
shared
I am trying to run simulation scenarios which in turn should provide me with the best scenario for a given date, back tested a couple of months. The input for a specific scenario has 4 input variables with each of the variables being able to be in 5 states (625 permutations). The flow of the model is as follows:
Simulate 625 scenarios to get each of their profit
Rank each of the scenarios according to their profit
Repeat the process through a 1-day expanding window for the last 2 months starting on the 1st Dec 2015 - creating a time series of ranks for each of the 625 scenarios
The unfortunate result for this is 5 nested for loops which can take extremely long to run. I had a look at the foreach package, but I am concerned around how the combining of the outputs will work in my scenario.
The current code that I am using works as follows, first I create the possible states of each of the inputs along with the window
a<-seq(as.Date("2015-12-01", "%Y-%m-%d"),as.Date(Sys.Date()-1, "%Y-%m-%d"),by="day")
#input variables
b<-seq(1,5,1)
c<-seq(1,5,1)
d<-seq(1,5,1)
e<-seq(1,5,1)
set.seed(3142)
tot_results<-NULL
Next the nested for loops proceed to run through the simulations for me.
for(i in 1:length(a))
{
cat(paste0("\n","Current estimation date: ", a[i]),";itteration:",i," \n")
#subset data for backtesting
dataset_calc<-dataset[which(dataset$Date<=a[i]),]
p=1
results<-data.frame(rep(NA,625))
for(j in 1:length(b))
{
for(k in 1:length(c))
{
for(l in 1:length(d))
{
for(m in 1:length(e))
{
if(i==1)
{
#create a unique ID to merge onto later
unique_ID<-paste0(replicate(1, paste(sample(LETTERS, 5, replace=TRUE), collapse="")),round(runif(n=1,min=1,max=1000000)))
}
#Run profit calculation
post_sim_results<-profit_calc(dataset_calc, param1=e[m],param2=d[l],param3=c[k],param4=b[j])
#Exctract the final profit amount
profit<-round(post_sim_results[nrow(post_sim_results),],2)
results[p,]<-data.frame(unique_ID,profit)
p=p+1
}
}
}
}
#extract the ranks for all scenarios
rank<-rank(results$profit)
#bind the ranks for the expanding window
if(i==1)
{
tot_results<-data.frame(ID=results[,1],rank)
}else{
tot_results<-cbind(tot_results,rank)
}
suppressMessages(gc())
}
My biggest concern is the binding of the results given that the outer loop's actions are dependent on the output of the inner loops.
Any advice on how proceed would greatly be appreciated.
So I think that you can vectorize most of this, which should give a big reduction in run time.
Currently, you use for-loops (5, to be exact) to create every combination of values, and then run the values one by one through profit_calc (a function that is not specified). Ideally, you'd just take all possible combinations in one go and push them through profit_calc in one single operation.
-- Rationale --
a <- 1:10
b <- 1:10
d <- rep(NA,10)
for (i in seq(a)) d[i] <- a[i] * b[i]
d
# [1] 1 4 9 16 25 36 49 64 81 100
Since * also works on vectors, we can rewrite this to:
a <- 1:10
b <- 1:10
d <- a*b
d
# [1] 1 4 9 16 25 36 49 64 81 100
While it may save us only one line of code, it actually reduces the problem from 10 steps to 1 step.
-- Application --
So how does that apply to your code? Well, given that we can vectorize profit_calc, you can basically generate a data frame where each row is every possible combination of your parameters. We can do this with expand.grid:
foo <- expand.grid(b,c,d,e)
head(foo)
# Var1 Var2 Var3 Var4
# 1 1 1 1 1
# 2 2 1 1 1
# 3 3 1 1 1
# 4 4 1 1 1
# 5 5 1 1 1
# 6 1 2 1 1
Lets say we have a formula... (a - b) / (c + d)... Then it would work like:
bar <- (foo[,1] - foo[,2]) * (foo[,3] + foo[,4])
head(bar)
# [1] 0 2 4 6 8 -2
So basically, try to find a way to replace for-loops with vectorized options. If you cannot vectorize something, try looking into apply instead, as that can also save you some time in most cases. If your code is running too slow, you'd ideally first see if you can write a more efficient script. Also, you may be interested in the microbenchmark library, or ?system.time.
I got this text file:
a b c d
0 2 8 9
2 0 3 4
8 3 0 2
9 4 2 0
I put this command in R:
k<-read.table("d:/r/file.txt", header=TRUE)
now I want to access the value in row 3 , column 4 (which is 2) ... how can I access it?
Basically my question is how to access table data one by one? I want to use all data separately in nested for loops.
Like:
for(row=0;row<4;row++)
for(col=0;col<4;col++)
print data[row][col];
You may want to apply a certain operation on each element of matrix.
This is how you could do it, an example
A <- matrix(1:16,4,4)
apply(A,c(1,2),function(x) {x %% 5})
And operation on the whole row
apply(A,1,function(x) sum(x^2))
Is this what you want? :
test <- read.table("test.txt", header = T, fill = T)
for(i in 1:nrow(test)){
for(j in 1:ncol(test)) {
print(test[i,j])
}
}
How do I assign to a dynamically created vector?
master<-c("bob","ed","frank")
d<-seq(1:10)
for (i in 1:length(master)){
assign(master[i], d )
}
eval(parse(text=master[2]))[2] # I can access the data
# but how can I assign to it THIS RETURNS AN ERROR #######################
eval(parse(text=master[2]))[2]<- 900
OK. I'll post this code but only because I was asked to:
> eval(parse(text=paste0( master[2], "[2]<- 900" ) ) )
> ed
[1] 1 900 3 4 5 6 7 8 9 10
It's generally considered bad practice to use such a method. You need to build the expression" ed[2] < 100. Using paste0 lets you evaluate master[2] as 'ed' which is then concatenated with the rest of the characters before passing to parse for convert to a language object. This would be more in keeping with what is considered better practice:
master<-c("bob","ed","frank")
d<-seq(1:10)
mlist <- setNames( lapply(seq_along(master), function(x) {d} ), master)
So changing the second value of the second item with <-:
> mlist[[2]][2] <- 900
> mlist[['ed']]
[1] 1 900 3 4 5 6 7 8 9 10
I have a dataframe with individuals assigned a text id that concatenates a place-name with a personal id (see data, below). Ultimately, I need to do a transformation of the data set from "long" to "wide" (e.g., using "reshape") so that each individual comprises one row, only. In order to do that, I need to assign a "time" variable that reshape can use to identify time-varying covariates, etc. I have (probably bad) code to do this for individuals that repeat up to two times, but need to be able to identify up to 18 repeated occurrences. The code below works fine if I remove the line preceded by the hash, but only identifies up to two repeats. If I leave that line in (which would seem necessary for individuals repeated more than twice), R chokes, giving the following error (presumably because the first individual is repeated only twice):
Error in if (data$uid[i] == data$uid[i - 2]) { :
argument is of length zero
Can anyone help with this? Thanks in advance!
place <- rep("ny",10)
pid <- c(1,1,2,2,2,3,4,4,5,5)
uid<- paste(place,pid,sep="")
time <- rep(0,10)
data <- cbind(uid,time)
data <- as.data.frame(data)
data$time <- as.numeric(data$time)
#bad code
data$time[1] <- 1 #need to set first so that loop doesn't go to a row that doesn't exist (i.e., row 0)
for (i in 2:NROW(data)){
data$time[i] <- 1 #set first occurrence to 1
if (data$uid[i] == data$uid[i-1]) {data$time[i] <- 2} #set second occurrence to 2, etc.
#if (data$uid[i] == data$uid[i-2]) {data$time[i] <- 3}
i <- i+1
}
It's unclear what you are trying to do, but I think you're saying that you need to create a time index for each row by every unique uid. Is that right?
If so, give this a whirl
library(plyr)
ddply(data, "uid", transform, time = seq_along(uid))
Will give you something like:
uid time
1 ny1 1
2 ny1 2
3 ny2 1
4 ny2 2
5 ny2 3
....
Is this what you have in mind?
> d <- data.frame(uid = paste("ny",c(1,2,1,2,2,3,4,4,5,5),sep=""))
> out <- do.call(rbind, lapply(split(d, d$uid), function(x) {x$time <- 1:nrow(x); x}))
> rownames(out) <- NULL
> out
uid time
1 ny1 1
2 ny1 2
3 ny2 1
4 ny2 2
5 ny2 3
6 ny3 1
7 ny4 1
8 ny4 2
9 ny5 1
10 ny5 2
Using your data frame setup:
place <- rep("ny",10)
pid <- c(1,1,2,2,2,3,4,4,5,5)
uid<- paste(place,pid,sep="")
time <- rep(0,10)
data <- cbind(uid,time)
data <- as.data.frame(data)
You can use:
data$time <- sequence(table(data$uid))
data
To get:
> data
uid time
1 ny1 1
2 ny1 2
3 ny2 1
4 ny2 2
5 ny2 3
6 ny3 1
7 ny4 1
8 ny4 2
9 ny5 1
10 ny5 2
NOTE: Your data.frame MUST be sorted by uid first for this to work.
After trying the above solutions on large data sets, I decided to write my own loop for this. It was very time-consuming and still required the data to be broken into 50k-element vectors, but it did work in the end:
system.time( for(i in 2:length(data$uid)) {
if(data$uid[i]==data$uid[i-1]) data$repeats[i] <- data$repeats[i-1]+1
if ((i %% 1000)== 0) { #helps to keep track of how far the loop has gotten
print(i) }
i+1
}
)
Thanks to all for your help.